ICML 2025 collected by Wang

Abstract:
Neural Processes (NPs) are deep probabilistic models that represent stochastic processes by conditioning their prior distributions on a set of context points. Despite their advantages in uncertainty estimation for complex distributions, NPs enforce parameterization coupling between the conditional prior model and the posterior model. We show that this coupling amounts to prior misspecification and revisit the NP objective to address this issue. More specifically, we propose Rényi Neural Processes (RNP), a method that replaces the standard KL divergence with the Rényi divergence, dampening the effects of the misspecified prior during posterior updates. We validate our approach across multiple benchmarks including regression and image inpainting tasks, and show significant performance improvements of RNPs in real-world problems. Our extensive experiments show consistently better log-likelihoods over state-of-the-art NP models.

Paperid:2 oral

Authors:Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Remi Munos, Alessandro Lazaric, Ahmed Touati

Title: Temporal Difference Flows

Abstract:
Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks, including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.

Paperid:3 oral

Authors:Linqi Zhou, Stefano Ermon, Jiaming Song

Title: Inductive Moment Matching

Abstract:
Diffusion models and Flow Matching generate high-quality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Moment Matching Self-Distillation (MMSD), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, MMSD does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, MMSD guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. MMSD surpasses diffusion models on ImageNet-256x256 with 2.13 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 2.05 on CIFAR-10 for a model trained from scratch.

Paperid:4 oral

Authors:Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang

Title: Mixture of Lookup Experts

Abstract:
Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE. Code:https://github.com/JieShibo/MoLE.

Paperid:5 oral

Authors:Josh Givens, Song Liu, Henry Reeve

Title: Score Matching with Missing Data

Abstract:
Score matching is a vital tool for learning the distribution of data with applications across many areas including diffusion processes, energy based modelling, and graphical model estimation. Despite all these applications, little work explores its use when data is incomplete. We address this by adapting score matching (and its major extensions) to work with missing data in a flexible setting where data can be partially missing over any subset of the coordinates. We provide two separate score matching variations for general use, an importance weighting (IW) approach, and a variational approach. We provide finite sample bounds for our IW approach in finite domain settings and show it to have especially strong performance in small sample lower dimensional cases. Complementing this, we show our variational approach to be strongest in more complex high-dimensional settings which we demonstrate on graphical model estimation tasks on both real and simulated data.

Paperid:6 oral

Authors:Brian Hu Zhang, Ioannis Anagnostides, Emanuel Tewolde, Ratip Emin Berker, Gabriele Farina, Vincent Conitzer, Tuomas Sandholm

Title: Expected Variational Inequalities

Abstract:
Variational inequalities (VIs)encompass many fundamental problems in diverse areas ranging from engineering to economics and machine learning. However, their considerable expressivity comes at the cost of computational intractability. In this paper, we introduce and analyze a natural relaxation—which we refer to asexpected variational inequalities (EVIs)—where the goal is to find a distribution that satisfies the VI constraint in expectation. By adapting recent techniques from game theory, we show that, unlike VIs, EVIs can be solved in polynomial time under general (nonmonotone) operators. EVIs capture the seminal notion of correlated equilibria, but enjoy a greater reach beyond games. We also employ our framework to capture and generalize several existing disparate results, including from settings such as smooth games, and games with coupled constraints or nonconcave utilities.

Paperid:7 oral

Authors:Fahim Tajwar, Yiding Jiang, Abitha Thankaraj, Sumaita Sadia Rahman, J Zico Kolter, Jeff Schneider, Russ Salakhutdinov

Title: Training a Generally Curious Agent

Abstract:
Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we presentPaprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.

Paperid:8 oral

Authors:Saeed Mahloujifar, Luca Melis, Kamalika Chaudhuri

Title: Auditingf-differential privacy in one run

Abstract:
Empirical auditing has emerged as a means of catching some of the flaws in the implementation of privacy-preserving algorithms. Existing auditing mechanisms, however, are either computationally inefficient -- requiring multiple runs of the machine learning algorithms �? or suboptimal in calculating an empirical privacy. In this work, we present a tight and efficient auditing procedure and analysis that can effectively assess the privacy of mechanisms. Our approach is efficient; Similar to the recent work of Steinke, Nasr and Jagielski (2023), our auditing procedure leverages the randomness of examples in the input dataset and requires only a single run of the target mechanism. And it is more accurate; we provide a novel analysis that enables us to achieve tight empirical privacy estimates by using the hypothesized $f$-DP curve of the mechanism, which provides a more accurate measure of privacy than the traditional $\epsilon,\delta$ differential privacy parameters. We use our auditing procure and analysis to obtain empirical privacy, demonstrating that our auditing procedure delivers tighter privacy estimates.

Paperid:9 oral

Authors:Nuno Gonçalves, Marcos V Treviso, Andre Martins

Title: AdaSplash: Adaptive Sparse Flash Attention

Abstract:
The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches---and in some cases surpasses---the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.

Paperid:10 oral

Authors:Jake C. Snell, Thomas L. Griffiths

Title: Conformal Prediction as Bayesian Quadrature

Abstract:
As machine learning-based prediction systems are increasingly used in high-stakes situations, it is important to understand how such predictive models will perform upon deployment. Distribution-free uncertainty quantification techniques such as conformal prediction provide guarantees about the loss black-box models will incur even when the details of the models are hidden. However, such methods are based on frequentist probability, which unduly limits their applicability. We revisit the central aspects of conformal prediction from a Bayesian perspective and thereby illuminate the shortcomings of frequentist guarantees. We propose a practical alternative based on Bayesian quadrature that provides interpretable guarantees and offers a richer representation of the likely range of losses to be observed at test time.

Paperid:11 oral

Authors:Niclas Boehmer, Sara Fish, Ariel D. Procaccia

Title: Generative Social Choice: The Next Generation

Abstract:
A key task in certain democratic processes is to produce a concise slate of statements that proportionally represents the full spectrum of user opinions. This task is similar to committee elections, but unlike traditional settings, the candidate set comprises all possible statements of varying lengths, and so it can only be accessed through specific queries. Combining social choice and large language models, prior work has approached this challenge through a framework of generative social choice. We extend the framework in two fundamental ways, providing theoretical guarantees even in the face of approximately optimal queries and a budget limit on the overall length of the slate. Using GPT-4o to implement queries, we showcase our approach on datasets related to city improvement measures and drug reviews, demonstrating its effectiveness in generating representative slates from unstructured user opinions.

Paperid:12 oral

Authors:Varun Babbar, Hayden McTavish, Cynthia Rudin, Margo Seltzer

Title: Near-Optimal Decision Trees in a SPLIT Second

Abstract:
Decision tree optimization is fundamental to interpretable machine learning. The most popular approach is to greedily search for the best feature at every decision point, which is fast but provably suboptimal. Recent approaches find the global optimum using branch and bound with dynamic programming, showing substantial improvements in accuracy and sparsity at great cost to scalability. An ideal solution would have the accuracy of an optimal method and the scalability of a greedy method. We introduce a family of algorithms called SPLIT (SParse Lookahead for Interpretable Trees) that moves us significantly forward in achieving this ideal balance. We demonstrate that not all sub-problems need to be solved to optimality to find high quality trees; greediness suffices near the leaves. Since each depth adds an exponential number of possible trees, this change makes our algorithms orders of magnitude faster than existing optimal methods, with negligible loss in performance. We extend this algorithm to allow scalable computation of sets of near-optimal trees (i.e., the Rashomon set).

Paperid:13 oral

Authors:Shuangfei Zhai, Ruixiang ZHANG, Preetum Nakkiran, David Berthelot, Jiatao Gu, Huangjie Zheng, Tianrong Chen, Miguel Ángel Bautista, Navdeep Jaitly, Joshua M. Susskind

Title: Normalizing Flows are Capable Generative Models

Abstract:
Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available athttps://github.com/apple/ml-tarflow.

Paperid:14 oral

Authors:Ronak Mehta, Zaid Harchaoui

Title: A Generalization Theory for Zero-Shot Prediction

Abstract:
A modern paradigm for generalization in machine learning and AI consists of pre-training a task-agnostic foundation model, generally obtained using self-supervised and multimodal contrastive learning. The resulting representations can be used for prediction on a downstream task for which no labeled data is available. We present a theoretical framework to better understand this approach, called zero-shot prediction. We identify the target quantities that zero-shot prediction aims to learn, or learns in passing, and the key conditional independence relationships that enable its generalization ability.

Paperid:15 oral

Authors:Yiming QIN, Manuel Madeira, Dorina Thanou, Pascal Frossard

Title: DeFoG: Discrete Flow Matching for Graph Generation

Abstract:
Graph generative models are essential across diverse scientific domains by capturing complex distributions over relational data. Among them, graph diffusion models achieve superior performance but face inefficient sampling and limited flexibility due to the tight coupling between training and sampling stages. We introduce DeFoG, a novel graph generative framework that disentangles sampling from training, enabling a broader design space for more effective and efficient model optimization. DeFoG employs a discrete flow-matching formulation that respects the inherent symmetries of graphs. We theoretically ground this disentangled formulation by explicitly relating the training loss to the sampling algorithm and showing that DeFoG faithfully replicates the ground truth graph distribution. Building on these foundations, we thoroughly investigate DeFoG's design space and propose novel sampling methods that significantly enhance performance and reduce the required number of refinement steps. Extensive experiments demonstrate state-of-the-art performance across synthetic, molecular, and digital pathology datasets, covering both unconditional and conditional generation settings. It also outperforms most diffusion-based models with just 5�?0% of their sampling steps.

Paperid:16 oral

Authors:Unai Fischer-Abaigar, Christoph Kern, Juan Carlos Perdomo

Title: The Value of Prediction in Identifying the Worst-Off

Abstract:
Machine learning is increasingly used in government programs to identify and support the most vulnerable individuals, prioritizing assistance for those at greatest risk over optimizing aggregate outcomes. This paper examines the welfare impacts of prediction in equity-driven contexts, and how they compare to other policy levers, such as expanding bureaucratic capacity. Through mathematical models and a real-world case study on long-term unemployment amongst German residents, we develop a comprehensive understanding of the relative effectiveness of prediction in surfacing the worst-off. Our findings provide clear analytical frameworks and practical, data-driven tools that empower policymakers to make principled decisions when designing these systems.

Paperid:17 oral

Authors:Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, LEI BAI, Xiang Wang

Title: Multi-agent Architecture Search via Agentic Supernet

Abstract:
Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce \textbf{MaAS}, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.

Paperid:18 oral

Authors:Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

Title: How Do Large Language Monkeys Get Their Power (Laws)?

Abstract:
Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently,�?�?orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

Paperid:19 oral

Authors:Alexandra Maria Proca, Clémentine Carla Juliette Dominé, Murray Shanahan, Pedro A. M. Mediano

Title: Learning dynamics in linear recurrent neural networks

Abstract:
Recurrent neural networks (RNNs) are powerful models used widely in both machine learning and neuroscience to learn tasks with temporal dependencies and to model neural dynamics. However, despite significant advancements in the theory of RNNs, there is still limited understanding of their learning process and the impact of the temporal structure of data. Here, we bridge this gap by analyzing the learning dynamics of linear RNNs (LRNNs) analytically, enabled by a novel framework that accounts for task dynamics. Our mathematical result reveals four key properties of LRNNs: (1) Learning of data singular values is ordered by both scale and temporal precedence, such that singular values that are larger and occur later are learned faster. (2) Task dynamics impact solution stability and extrapolation ability. (3) The loss function contains an effective regularization term that incentivizes small weights and mediates a tradeoff between recurrent and feedforward computation. (4) Recurrence encourages feature learning, as shown through a novel derivation of the neural tangent kernel for finite-width LRNNs. As a final proof-of-concept, we apply our theoretical framework to explain the behavior of LRNNs performing sensory integration tasks. Our work provides a first analytical treatment of the relationship between the temporal dependencies in tasks and learning dynamics in LRNNs, building a foundation for understanding how complex dynamic behavior emerges in cognitive models.

Paperid:20 oral

Authors:Amber Yijia Zheng, Site Bai, Brian Bullins, Raymond A. Yeh

Title: Model Immunization from a Condition Number Perspective

Abstract:
Model immunization aims to pre-train models that are difficult to fine-tune on harmful tasks while retaining their utility on other non-harmful tasks. Though prior work has shown empirical evidence for immunizing text-to-image models, the key understanding of when immunization is possible and a precise definition of an immunized model remain unclear. In this work, we propose a framework, based on the condition number of a Hessian matrix, to analyze model immunization for linear models. Building on this framework, we design an algorithm with regularization terms to control the resulting condition numbers after pre-training. Empirical results on linear models and non-linear deep-nets demonstrate the effectiveness of the proposed algorithm on model immunization. The code is available athttps://github.com/amberyzheng/model-immunization-cond-num.

Paperid:21 oral

Authors:Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

Title: Long-Form Speech Generation with Spoken Language Models

Abstract:
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we deriveSpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce:LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found athttps://google.github.io/tacotron/publications/speechssm/.

Paperid:22 oral

Authors:Georgy Noarov, Ramya Ramalingam, Aaron Roth, Stephan Xie

Title: High-Dimensional Prediction for Sequential Decision Making

Abstract:
We give an efficient algorithm for producing multi-dimensional forecasts in an online adversarial environment that have low bias subject to any polynomial number of conditioning events, that can depend both on external context and on our predictions themselves. We demonstrate the use of this algorithm with several applications. We show how to make predictions that can be transparently consumed by any polynomial number of downstream decision makers with different utility functions, guaranteeing them diminishing swap regret at optimal rates. We also give the first efficient algorithms for guaranteeing diminishing conditional regret in online combinatorial optimization problems for an arbitrary polynomial number of conditioning events --- i.e. on an arbitrary number of intersecting subsequences determined both by context and our own predictions. Finally, we give the first efficient algorithm for online multicalibration with $O(T^{2/3})$ rates in the ECE metric.

Paperid:23 oral

Authors:Lorenzo Lucchese, Mikko S. Pakkanen, Almut E. D. Veraart

Title: Learning with Expected Signatures: Theory and Applications

Abstract:
The expected signature maps a collection of data streams to a lower dimensional representation, with a remarkable property: the resulting feature tensor can fully characterize the data generating distribution. This "model-free"' embedding has been successfully leveraged to build multiple domain-agnostic machine learning (ML) algorithms for time series and sequential data. The convergence results proved in this paper bridge the gap between the expected signature's empirical discrete-time estimator and its theoretical continuous-time value, allowing for a more complete probabilistic interpretation of expected signature-based ML methods. Moreover, when the data generating process is a martingale, we suggest a simple modification of the expected signature estimator with significantly lower mean squared error and empirically demonstrate how it can be effectively applied to improve predictive performance.

Paperid:24 oral

Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan

Title: Statistical Collusion by Collectives on Learning Platforms

Abstract:
As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collectives need to make a priori assessments of the effect of the collective before taking action, as they may face potential risks when modifying their data. Moreover they need to develop implementable coordination algorithms based on quantities that can be inferred from observed data. We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain.

Paperid:25 oral

Authors:Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao

Title: CollabLLM: From Passive Responders to Active Collaborators

Abstract:
Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responses using Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions—a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.

Paperid:26 oral

Authors:Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding

Title: ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Abstract:
We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available athttps://github.com/heshuting555/ReferSplat.

Paperid:27 oral

Authors:Jeffrey A Chan Santiago, praveen tirupattur, Gaurav Kumar Nayak, Gaowen Liu, Mubarak Shah

Title: MGD3: Mode-Guided Dataset Distillation using Diffusion Models

Abstract:
Dataset distillation has emerged as an effective strategy, significantly reducing training costs and facilitating more efficient model deployment. Recent advances have leveraged generative models to distill datasets by capturing the underlying data distribution. Unfortunately, existing methods require model fine-tuning with distillation losses to encourage diversity and representativeness. However, these methods do not guarantee sample diversity, limiting their performance. We propose a mode-guided diffusion model leveraging a pre-trained diffusion model without the need to fine-tune with distillation losses. Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples that affect performance. We evaluate our approach on ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K, achieving accuracy improvements of 4.4%, 2.9%, 1.6%, and 1.6%, respectively, over state-of-the-art methods. Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs.

Paperid:28 oral

Authors:Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu

Title: STAIR: Improving Safety Alignment with Introspective Reasoning

Abstract:
Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we proposeSTAIR, a novel framework that integratesSafeTyAlignment withItrospectiveReasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). Specifically, we design a theoretically grounded reward for outcome evaluation to seek balance between helpfulness and safety. We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. We have open-sourced our code, datasets and models athttps://github.com/thu-ml/STAIR.

Paperid:29 oral

Authors:Michael Sucker, Peter Ochs

Title: A Generalization Result for Convergence in Learning-to-Optimize

Abstract:
Learning-to-optimize leverages machine learning to accelerate optimization algorithms. While empirical results show tremendous improvements compared to classical optimization algorithms, theoretical guarantees are mostly lacking, such that the outcome cannot be reliably assured. Especially, convergence is hardly studied in learning-to-optimize, because conventional convergence guarantees in optimization are based on geometric arguments, which cannot be applied easily to learned algorithms. Thus, we develop a probabilistic framework that resembles classical optimization and allows for transferring geometric arguments into learning-to-optimize. Based on our new proof-strategy, our main theorem is a generalization result for parametric classes of potentially non-smooth, non-convex loss functions and establishes the convergence of learned optimization algorithms to critical points with high probability. This effectively generalizes the results of a worst-case analysis into a probabilistic framework, and frees the design of the learned algorithm from using safeguards.

Paperid:30 oral

Authors:Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

Title: VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Abstract:
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships. VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code and model weights will be publicly released.

Paperid:31 oral

Authors:Peter Halmos, Julian Gold, Xinhao Liu, Benjamin Raphael

Title: Hierarchical Refinement: Optimal Transport to Infinity and Beyond

Abstract:
Optimal transport (OT) has enjoyed great success in machine learning as a principled way to align datasets via a least-cost correspondence, driven in large part by the runtime efficiency of the Sinkhorn algorithm (Cuturi, 2013). However, Sinkhorn has quadratic space complexity in the number of points, limiting scalability to larger datasets. Low-rank OT achieves linear-space complexity, but by definition, cannot compute a one-to-one correspondence between points. When the optimal transport problem is an assignment problem between datasets then an optimal mapping, known as theMonge map, is guaranteed to be a bijection. In this setting, we show that the factors of an optimal low-rank coupling co-cluster each point with its image under the Monge map. We leverage this invariant to derive an algorithm,Hierarchical Refinement(HiRef), that dynamically constructs a multiscale partition of each dataset using low-rank OT subproblems, culminating in a bijective coupling. Hierarchical Refinement uses linear space and has log-linear runtime, retaining the space advantage of low-rank OT while overcoming its limited resolution. We demonstrate the advantages of Hierarchical Refinement on several datasets, including ones containing over a million points, scaling full-rank OT to problems previously beyond Sinkhorn's reach.

Paperid:32 oral

Authors:Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, et al. (12 additional authors not shown)

Title: On Path to Multimodal Generalist: General-Level and General-Bench

Abstract:
The Multimodal Large Language Model (MLLM) is currently experiencing rapid growth, driven by the advanced capabilities of language-based LLMs. Unlike their specialist predecessors, existing MLLMs are evolving towards a Multimodal Generalist paradigm. Initially limited to understanding multiple modalities, these models have advanced to not only comprehend but also generate across modalities. Their capabilities have expanded from coarse-grained to fine-grained multimodal understanding and from supporting singular modalities to accommodating a wide array of or even arbitrary modalities. To assess the capabilities of various MLLMs, a diverse array of benchmark test sets has been proposed. This leads to a critical question:Can we simply assume that higher performance across tasks indicates a stronger MLLM capability, bringing us closer to human-level AI?We argue that the answer is not as straightforward as it seems. In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, namedGeneral-Level, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use ofSynergyas the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark,General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI. Project Page:https://generalist.top/, Leaderboard:https://generalist.top/leaderboard/, Benchmark:https://huggingface.co/General-Level/.

Paperid:33 oral

Authors:Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, Mingsheng Long

Title: Sundial: A Family of Highly Capable Time Series Foundation Models

Abstract:
We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the next-patch's distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on continuous-valued time series without discrete tokenization. Conditioned on arbitrary-length time series, our models are pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving more flexibility in representation learning than using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with one trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse via TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which achieve unprecedented model capacity and generalization performance. In addition to excellent scalability, Sundial achieves state-of-the-art results on both point and probabilistic forecasting benchmarks with a just-in-time inference speed, i.e., making zero-shot predictions within a few milliseconds. We believe that Sundial's pioneering generative forecasting capability can improve model reliability in real-world decision-making. Code is available at:https://github.com/thuml/Sundial.

Paperid:34 oral

Authors:Clément Bonet, Christophe Vauthier, Anna Korba

Title: Flowing Datasets with Wasserstein over Wasserstein Gradient Flows

Abstract:
Many applications in machine learning involve data represented as probability distributions. The emergence of such data requires radically novel techniques to design tractable gradient flows on probability distributions over this type of (infinite-dimensional) objects. For instance, being able to flow labeled datasets is a core task for applications ranging from domain adaptation to transfer learning or dataset distillation. In this setting, we propose to represent each class by the associated conditional distribution of features, and to model the dataset as a mixture distribution supported on these classes (which are themselves probability distributions), meaning that labeled datasets can be seen as probability distributions over probability distributions. We endow this space with a metric structure from optimal transport, namely the Wasserstein over Wasserstein (WoW) distance, derive a differential structure on this space, and define WoW gradient flows. The latter enables to design dynamics over this space that decrease a given objective functional. We apply our framework to transfer learning and dataset distillation tasks, leveraging our gradient flow construction as well as novel tractable functionals that take the form of Maximum Mean Discrepancies with Sliced-Wasserstein based kernels between probability distributions.

Paperid:35 oral

Authors:Mason Kamb, Surya Ganguli

Title: An analytic theory of creativity in convolutional diffusion models

Abstract:
We obtain an analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, score-matching diffusion models can generate highly original images that lie far from their training data. However, optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial creativity by preventing optimal score-matching; (2) result in fully analytic, completely mechanistically interpretable, local score (LS) and equivariant local score (ELS) machines that, (3) after calibrating a single time-dependent hyperparameter can quantitatively predict the outputs of trained convolution only diffusion models (like ResNets and UNets) with high accuracy (medianr2of0.95,0.94,0.94,0.96for our top model on CIFAR10, FashionMNIST, MNIST, and CelebA). Our model reveals a {\it locally consistent patch mosaic} mechanism of creativity, in which diffusion models create exponentially many novel images by mixing and matching different local training set patches at different scales and image locations. Our theory also partially predicts the outputs of pre-trained self-attention enabled UNets (medianr2�?.77on CIFAR10), revealing an intriguing role for attention in carving out semantic coherence from local patch mosaics.

Paperid:36 oral

Authors:Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun

Title: DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Abstract:
Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.

Paperid:37 oral

Authors:Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv

Title: Layer by Layer: Uncovering Hidden Representations in Language Models

Abstract:
From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer’s performance. Through extensive experiments on 32 text-embedding tasks across various architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features, challenging the standard view on final-layer embeddings and opening new directions on using mid-layer representations for more robust and accurate representations.

Paperid:38 oral

Authors:Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano

Title: Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Abstract:
Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.

Paperid:39 oral

Authors:Konrad Mundinger, Max Zimmer, Aldo Kiem, Christoph Spiegel, Sebastian Pokutta

Title: Neural Discovery in Mathematics: Do Machines Dream of Colored Planes?

Abstract:
We demonstrate how neural networks can drive mathematical discovery through a case study of the Hadwiger-Nelson problem, a long-standing open problem at the intersection of discrete geometry and extremal combinatorics that is concerned with coloring the plane while avoiding monochromatic unit-distance pairs. Using neural networks as approximators, we reformulate this mixed discrete-continuous geometric coloring problem with hard constraints as an optimization task with a probabilistic, differentiable loss function. This enables gradient-based exploration of admissible configurations that most significantly led to the discovery of two novel six-colorings, providing the first improvement in thirty years to the off-diagonal variant of the original problem (Mundinger et al., 2024a). Here, we establish the underlying machine learning approach used to obtain these results and demonstrate its broader applicability through additional numerical insights.

Paperid:40 oral

Authors:Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng

Title: Learning Dynamics in Continual Pre-Training for Large Language Models

Abstract:
Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore thelearning dynamicsthroughout the CPT process for large language models (LLMs). We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate (LR) annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including the learning rate, the training steps, and the distribution distance between PT and CPT datasets. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.

Paperid:41 oral

Authors:Yejiang Wang, Yuhai Zhao, Zhengkui Wang, Ling Li, Jiapu Wang, Fangting Li, Miaomiao Huang, Shirui Pan, Xingwei Wang

Title: Equivalence is All: A Unified View for Self-supervised Graph Learning

Abstract:
Node equivalence is common in graphs, such as computing networks, encompassing automorphic equivalence (preserving adjacency under node permutations) and attribute equivalence (nodes with identical attributes). Despite their importance for learning node representations, these equivalences are largely ignored by existing graph models. To bridge this gap, we propose a GrAph self-supervised Learning framework with Equivalence (GALE) and analyze its connections to existing techniques. Specifically, we: 1) unify automorphic and attribute equivalence into a single equivalence class; 2) enforce the equivalence principle to make representations within the same class more similar while separating those across classes; 3) introduce approximate equivalence classes with linear time complexity to address the NP-hardness of exact automorphism detection and handle node-feature variation; 4) analyze existing graph encoders, noting limitations in message passing neural networks and graph transformers regarding equivalence constraints; 5) show that graph contrastive learning are a degenerate form of equivalence constraint; and 6) demonstrate that GALE achieves superior performance over baselines.

Paperid:42 oral

Authors:Fangwen Wu, Lechao Cheng, Shengeng Tang, Xiaofeng Zhu, Chaowei Fang, Dingwen Zhang, Meng Wang

Title: Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning

Abstract:
Class-incremental learning (CIL) seeks to enable a model to sequentially learn new classes while retaining knowledge of previously learned ones. Balancing flexibility and stability remains a significant challenge, particularly when the task ID is unknown. To address this, our study reveals that the gap in feature distribution between novel and existing tasks is primarily driven by differences in mean and covariance moments. Building on this insight, we propose a novel semantic drift calibration method that incorporates mean shift compensation and covariance calibration. Specifically, we calculate each class's mean by averaging its sample embeddings and estimate task shifts using weighted embedding changes based on their proximity to the previous mean, effectively capturing mean shifts for all learned classes with each new task. We also apply Mahalanobis distance constraint for covariance calibration, aligning class-specific embedding covariances between old and current networks to mitigate the covariance shift. Additionally, we integrate a feature-level self-distillation approach to enhance generalization. Comprehensive experiments on commonly used datasets demonstrate the effectiveness of our approach. The source code is available athttps://github.com/fwu11/MACIL.git.

Paperid:43 oral

Authors:Gramoz Goranci, Peter Kiss, Neel Patel, Martin P. Seybold, Eva Szilagyi, Da Wei Zheng

Title: Fully Dynamic Euclidean Bi-Chromatic Matching in Sublinear Update Time

Abstract:
We consider the Euclidean bi-chromatic matching problem in the dynamic setting, where the goal is to efficiently process point insertions and deletions while maintaining a high-quality solution. Computing the minimum cost bi-chromatic matching is one of the core problems in geometric optimization that has found many applications, most notably in estimating Wasserstein distance between two distributions. In this work, we present the first fully dynamic algorithm for Euclidean bi-chromatic matching with sublinear update time. For any fixed $\varepsilon > 0$, our algorithm achieves $O(1/\varepsilon)$-approximation and handles updates in $O(n^{\varepsilon})$ time. Our experiments show that our algorithm enables effective monitoring of the distributional drift in the Wasserstein distance on real and synthetic data sets, while outperforming the runtime of baseline approximations by orders of magnitudes.

Paperid:44 oral

Authors:Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He

Title: CodeIO: Condensing Reasoning Patterns via Code Input-Output Prediction

Abstract:
Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives—like logic flow planning, state-space searching, decision tree traversal, and modular decomposition—while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models will be publicly available.

Paperid:45 oral

Authors:Longzhu He, Chaozhuo Li, Peng Tang, Sen Su

Title: Going Deeper into Locally Differentially Private Graph Neural Networks

Abstract:
Graph Neural Networks (GNNs) have demonstrated superior performance in a variety of graph mining and learning tasks. However, when node representations involve sensitive personal information or variables related to individuals, learning from graph data can raise significant privacy concerns. Although recent studies have explored local differential privacy (LDP) to address these concerns, they often introduce significant distortions to graph data, severely degrading private learning utility (e.g., node classification accuracy). In this paper, we present UPGNET, an LDP-based privacy-preserving graph learning framework that enhances utility while protecting user data privacy. Specifically, we propose a three-stage pipeline that generalizes the LDP protocols for node features, targeting privacy-sensitive scenarios. Our analysis identifies two key factors that affect the utility of privacy-preserving graph learning:feature dimensionandneighborhood size. Based on the above analysis, UPGNET enhances utility by introducing two core layers: High-Order Aggregator (HOA) layer and the Node Feature Regularization (NFR) layer. Extensive experiments on real-world datasets indicate that UPGNET significantly outperforms existing methods in terms of both privacy protection and learning utility.

Paperid:46 oral

Authors:Tomohiro Shiraishi, Tatsuya Matsukawa, Shuichi Nishino, Ichiro Takeuchi

Title: Statistical Test for Feature Selection Pipelines by Selective Inference

Abstract:
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating various analysis algorithms. In this paper, we propose a novel statistical test to assess the significance of data analysis pipelines. Our approach enables the systematic development of valid statistical tests applicable to any feature selection pipeline composed of predefined components. We develop this framework based on selective inference, a statistical technique that has recently gained attention for data-driven hypotheses. As a proof of concept, we focus on feature selection pipelines for linear models, composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We theoretically prove that our statistical test can control the probability of false positive feature selection at any desired level, and demonstrate its validity and effectiveness through experiments on synthetic and real data. Additionally, we present an implementation framework that facilitates testing across any configuration of these feature selection pipelines without extra implementation costs.

Paperid:47 oral

Authors:Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You

Title: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

Abstract:
Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show thatsparse codingoffers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We proposeContrastive Sparse Representation(CSR), a method that specifies pre-trained embeddings into a high-dimensional butselectively activatedfeature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed—often by large margins—while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available atthis URL.

Paperid:48 oral

Authors:Niclas Dern, John Patrick Cunningham, Geoff Pleiss

Title: Theoretical Limitations of Ensembles in the Age of Overparameterization

Abstract:
Classic ensembles generalize better than any single component model. In contrast, recent empirical studies find that modern ensembles of (overparameterized) neural networks may not provide any inherent generalization advantage over single but larger neural networks. This paper clarifies how modern overparameterized ensembles differ from their classic underparameterized counterparts, using ensembles of random feature (RF) regressors as a basis for developing theory. In contrast to the underparameterized regime, where ensembling typically induces regularization and increases generalization, we prove with minimal assumptions that infinite ensembles of overparameterized RF regressors become pointwise equivalent to (single) infinite-width RF regressors, and finite width ensembles rapidly converge to single models with the same parameter budget. These results, which are exact for ridgeless models and approximate for small ridge penalties, imply that overparameterized ensembles and single large models exhibit nearly identical generalization. We further characterize the predictive variance amongst ensemble members, demonstrating that it quantifies the expected effects of increasing capacity rather than capturing any conventional notion of uncertainty. Our results challenge common assumptions about the advantages of ensembles in overparameterized settings, prompting a reconsideration of how well intuitions from underparameterized ensembles transfer to deep ensembles and the overparameterized regime.

Paperid:49 oral

Authors:Kunal Jha, Wilka Carvalho, Yancheng Liang, Simon Shaolei Du, Max Kleiman-Weiner, Natasha Jaques

Title: Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination

Abstract:
Zero-shot coordination (ZSC), the ability to adapt to a new partner in a cooperative task, is a critical component of human-compatible AI. While prior work has focused on training agents to cooperate on a single task, these specialized models do not generalize to new tasks, even if they are highly similar. Here, we study how reinforcement learning on adistribution of environments with a single partnerenables learning general cooperative skills that support ZSC withmany new partners on many new problems. We introducetwoJax-based, procedural generators that create billions of solvable coordination challenges. We develop a new paradigm calledCross-Environment Cooperation (CEC), and show that it outperforms competitive baselines quantitatively and qualitatively when collaborating with real people. Our findings suggest that learning to collaborate across many unique scenarios encourages agents to develop general norms, which prove effective for collaboration with different partners. Together, our results suggest a new route toward designing generalist cooperative agents capable of interacting with humans without requiring human data.

Paperid:50 oral

Authors:Konstantinos Oikonomidis, Jan Quan, Emanuel Laude, Panagiotis Patrinos

Title: Nonlinearly Preconditioned Gradient Methods under Generalized Smoothness

Abstract:
We analyze nonlinearly preconditioned gradient methods for solving smooth minimization problems. We introduce a generalized smoothness property, based on the notion of abstract convexity, that is broader than Lipschitz smoothness and provide sufficient first- and second-order conditions. Notably, our framework encapsulates algorithms associated with the gradient clipping method and brings out novel insights for the class of(L0,L1)-smooth functions that has received widespread interest recently, thus allowing us to extend beyond already established methods. We investigate the convergence of the proposed method in both the convex and nonconvex setting.

Paperid:51 oral

Authors:Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee

Title: VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Abstract:
Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduceVersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Paperid:52 oral

Authors:Tian-Zuo Wang, Wen-Bo Du, Zhi-Hua Zhou

Title: Polynomial-Delay MAG Listing with Novel Locally Complete Orientation Rules

Abstract:
A maximal ancestral graph (MAG) is widely used to characterize the causal relations among observable variables in the presence of latent variables. However, given observational data, only a partial ancestral graph representing a Markov equivalence class (MEC) of MAGs is identifiable, which generally contains uncertain causal relations. Due to the uncertainties, \emph{MAG listing}, \emph{i.e.}, listing all the MAGs in the MEC, is critical for many downstream tasks. In this paper, we present the first \emph{polynomial-delay} MAG listing method, where delay refers to the time for outputting each MAG, through introducing enumerated structural knowledge in the form of \emph{singleton background knowledge (BK)}. To incorporate such knowledge, we propose the \emph{sound} and \emph{locally complete} orientation rules. By recursively introducing singleton BK and applying the rules, our method can output all and only MAGs in the MEC with polynomial delay. Additionally, while the proposed novel rules enable more efficient MAG listing, for the goal of incorporating general BK, we present two counterexamples to imply that existing rules including ours, are not yet \emph{complete}, which motivate two more rules. Experimental results validate the efficiency of the proposed MAG listing method.

Paperid:53 oral

Authors:Sumin Cho, Dongwon Kim, Kwangsu Kim

Title: One-Step Generalization Ratio Guided Optimization for Domain Generalization

Abstract:
Domain Generalization (DG) aims to train models that generalize to unseen target domains but often overfit to domain-specific features, known as undesired correlations. Gradient-based DG methods typically guide gradients in a dominant direction but often inadvertently reinforce spurious correlations. Recent work has employed dropout to regularize overconfident parameters, but has not explicitly adjusted gradient alignment or ensured balanced parameter updates. We propose GENIE (Generalization-ENhancing Iterative Equalizer), a novel optimizer that leverages the One-Step Generalization Ratio (OSGR) to quantify each parameter's contribution to loss reduction and assess gradient alignment. By dynamically equalizing OSGR via a preconditioning factor, GENIE prevents a small subset of parameters from dominating optimization, thereby promoting domain-invariant feature learning. Theoretically, GENIE balances convergence contribution and gradient alignment among parameters, achieving higher OSGR while retaining SGD's convergence rate. Empirically, it outperforms existing optimizers and enhances performance when integrated with various DG and single-DG methods.

Paperid:54 oral

Authors:Ermis Soumalias, Jakob Heiss, Jakob Weissteiner, Sven Seuken

Title: Prices, Bids, Values: One ML-Powered Combinatorial Auction to Rule Them All

Abstract:
We study the design ofiterative combinatorial auctions (ICAs). The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, recent work has proposed machine learning (ML)-based preference elicitation algorithms that aim to elicit only the most critical information from bidders to maximize efficiency. However, while the SOTA ML-based algorithms elicit bidders' preferences viavalue queries, ICAs that are used in practice elicit information viademand queries. In this paper, we introduce a novel ML algorithm that provably makes use of the full information from both value and demand queries, and we show via experiments that combining both query types results in significantly better learning performance in practice. Building on these insights, we present MLHCA, a new ML-powered auction that uses value and demand queries. MLHCA significantly outperforms the previous SOTA, reducing efficiency loss by up to a factor 10, with up to 58% fewer queries. Thus, MLHCA achieves large efficiency improvements while also reducing bidders' cognitive load, establishing a new benchmark for both practicability and efficiency. Our code is available athttps://github.com/marketdesignresearch/MLHCA.

Paperid:55 oral

Authors:Saurabh Jha, Rohan R. Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir O Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, et al. (20 additional authors not shown)

Title: ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Abstract:
Realizing the vision of using AI agents to automate critical IT tasks depends on the ability to measure and understand effectiveness of proposed solutions. We introduce ITBench, a framework that offers a systematic methodology for benchmarking AI agents to address real-world IT automation tasks. Our initial release targets three key areas: Site Reliability Engineering (SRE), Compliance and Security Operations (CISO), and Financial Operations (FinOps). The design enables AI researchers to understand the challenges and opportunities of AI agents for IT automation with push-button workflows and interpretable metrics. IT-Bench includes an initial set of 102 real-world scenarios, which can be easily extended by community contributions. Our results show that agents powered by state-of-the-art models resolve only 11.4% of SRE scenarios, 25.2% of CISO scenarios, and 25.8% of FinOps scenarios (excluding anomaly detection). For FinOps-specific anomaly detection (AD) scenarios, AI agents achieve an F1 score of 0.35. We expect ITBench to be a key enabler of AI-driven IT automation that is correct, safe, and fast. IT-Bench, along with a leaderboard and sample agent implementations, is available athttps://github.com/ibm/itbench.

Paperid:56 oral

Authors:Santhosh Karnik, Anna Veselovska, Mark Iwen, Felix Krahmer

Title: Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent

Abstract:
We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime. For matrix factorization problems, this phenomenon has been studied in a number of works. A particular challenge has been to design universal initialization strategies which provably lead to implicit regularization in gradient-descent methods. At the same time, it has been argued by Cohen et. al. 2016 that more general classes of neural networks can be captured by considering tensor factorizations. However, in the tensor case, implicit regularization has only been rigorously established for gradient flow or in the lazy training regime. In this paper, we prove the first tensor result of its kind for gradient descent rather than gradient flow. We focus on the tubal tensor product and the associated notion of low tubal rank, encouraged by the relevance of this model for image data. We establish that gradient descent in an overparametrized tensor factorization model with a small random initialization exhibits an implicit bias towards solutions of low tubal rank. Our theoretical findings are illustrated in an extensive set of numerical simulations show-casing the dynamics predicted by our theory as well as the crucial role of using a small random initialization.

Paperid:57 oral

Authors:Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, Duen Horng Chau

Title: ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Abstract:
Do the rich representations of multi-modal diffusion transformers (DiTs) exhibit unique properties that enhance their interpretability? We introduce ConceptAttention, a novel method that leverages the expressive power of DiT attention layers to generate high-quality saliency maps that precisely locate textual concepts within images. Without requiring additional training, ConceptAttention repurposes the parameters of DiT attention layers to produce highly contextualizedconcept embeddings, contributing the major discovery that performing linear projections in the output space of DiT attention layers yields significantly sharper saliency maps compared to commonly used cross-attention maps. ConceptAttention even achieves state-of-the-art performance on zero-shot image segmentation benchmarks, outperforming 15 other zero-shot interpretability methods on the ImageNet-Segmentation dataset. ConceptAttention works for popular image models and even seamlessly generalizes to video generation. Our work contributes the first evidence that the representations of multi-modal DiTs are highly transferable to vision tasks like segmentation.

Paperid:58 oral

Authors:Juan L. Gamella, Simon Bing, Jakob Runge

Title: Sanity Checking Causal Representation Learning on a Simple Real-World System

Abstract:
We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors---the inputs to the experiment---are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at [object Object].

Paperid:59 oral

Authors:Shiyuan Feng, Ying Feng, George Zhaoqi Li, Zhao Song, David Woodruff, Lichen Zhang

Title: On Differential Privacy for Adaptively Solving Search Problems via Sketching

Abstract:
Recently differential privacy has been used for a number of streaming, data structure, and dynamic graph problems as a means of hiding the internal randomness of the data structure, so that multiple possibly adaptive queries can be made without sacrificing the correctness of the responses. Although these works use differential privacy to show that for some problems it is possible to tolerate $T$ queries using $\widetilde{O}(\sqrt{T})$ copies of a data structure, such results only apply to {\it numerical estimation problems}, and only return the {\it cost} of an optimization problem rather than the solution itself. In this paper we investigate the use of differential privacy for adaptive queries to {\it search} problems, which are significantly more challenging since the responses to queries can reveal much more about the internal randomness than a single numerical query. We focus on two classical search problems: nearest neighbor queries and regression with arbitrary turnstile updates. We identify key parameters to these problems, such as the number of $c$-approximate near neighbors and the matrix condition number, and use different differential privacy techniques to design algorithms returning the solution point or solution vector with memory and time depending on these parameters. We give algorithms for each of these problems that achieve similar tradeoffs.

Paperid:60 oral

Authors:Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

Title: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Abstract:
We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad emergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?

Paperid:61 oral

Authors:Guozheng Ma, Lu Li, Zilin Wang, Li Shen, Pierre-Luc Bacon, Dacheng Tao

Title: Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning

Abstract:
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motivating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.

Paperid:62 oral

Authors:Sibylle Marcotte, Rémi Gribonval, Gabriel Peyré

Title: Transformative or Conservative? Conservation laws for ResNets and Transformers

Abstract:
While conservation laws in gradient flow training dynamics are well understood for (mostly shallow) ReLU and linear networks, their study remains largely unexplored for more practical architectures. For this, we first show that basic building blocks such as ReLU (or linear) shallow networks, with or without convolution, have easily expressed conservation laws, and no more than the known ones. In the case of a single attention layer, we also completely describe all conservation laws, and we show that residual blocks have the same conservation laws as the same block without a skip connection. We then introduce the notion of conservation laws that depend only ona subsetof parameters (corresponding e.g. to a pair of consecutive layers, to a residual block, or to an attention layer). We demonstrate that the characterization of such laws can be reduced to the analysis of the corresponding building block in isolation. Finally, we examine how these newly discovered conservation principles, initially established in the continuous gradient flow regime, persist under discrete optimization dynamics, particularly in the context of Stochastic Gradient Descent (SGD).

Paperid:63 oral

Authors:Marvin Li, Aayush Karan, Sitan Chen

Title: Blink of an eye: a simple theory for feature localization in generative models

Abstract:
Large language models can exhibit unexpected behavior in the blink of an eye. In a recent computer use demo, a language model switched from coding to Googling pictures of Yellowstone, and these sudden shifts in behavior have also been observed in reasoning patterns and jailbreaks. This phenomenon is not unique to autoregressive models: in diffusion models, key features of the final output are decided in narrow ``critical windows'' of the generation process. In this work we develop a simple, unifying theory to explain this phenomenon. Using the formalism of stochastic localization for generative models, we show that it emerges generically as the generation process localizes to a sub-population of the distribution it models. While critical windows have been studied at length in diffusion models, existing theory heavily relies on strong distributional assumptions and the particulars of Gaussian diffusion. In contrast to existing work our theory (1) applies to autoregressive and diffusion models; (2) makes very few distributional assumptions; (3) quantitatively improves previous bounds even when specialized to diffusions; and (4) requires basic mathematical tools. Finally, we validate our predictions empirically for LLMs and find that critical windows often coincide with failures in problem solving for various math and reasoning benchmarks.

Paperid:64 oral

Authors:Yangsibo Huang, Milad Nasr, Anastasios Nikolas Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, Chiyuan Zhang

Title: Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Abstract:
It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.

Paperid:65 oral

Authors:Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, Li Yuan

Title: Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Abstract:
Detecting AI-generated images (AIGIs), such as natural images or face images, has become increasingly important yet challenging. In this paper, we start from a new perspective to excavate the reason behind the failure generalization in AIGI detection, named the asymmetry phenomenon, where a naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked, which is proved seriously limiting the expressivity and generalization. One potential remedy is incorporating the pre-trained knowledge within the vision foundation models (higher-ranked) to expand the feature space, alleviating the model's overfitting to fake. To this end, we employ Singular Value Decomposition (SVD) to decompose the original feature space into two orthogonal subspaces. By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns. Compared to existing full-parameters and LoRA-based tuning methods, we explicitly ensure orthogonality, enabling the higher rank of the whole feature space, effectively minimizing overfitting and enhancing generalization. We finally identify a crucial insight: our method implicitly learns a vital prior that fakes are actually derived from the real, indicating a hierarchical relationship rather than independence. Modeling this prior, we believe, is essential for achieving superior generalization. Our codes are publicly available athttps://github.com/YZY-stack/Effort-AIGI-Detection.

Paperid:66 oral

Authors:Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang

Title: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Abstract:
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising ``deep thinking'' through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data synthesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0%, surpassing o1-preview by +4.5%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% of the brightest high school math students. Code and data are available athttps://github.com/microsoft/rStar.

Paperid:67 oral

Authors:Loek van Rossem, Andrew M Saxe

Title: Algorithm Development in Neural Networks: Insights from the Streaming Parity Task

Abstract:
Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.

Paperid:68 oral

Authors:Aaditya K Singh, Ted Moskovitz, Sara Dragutinović, Felix Hill, Stephanie C.Y. Chan, Andrew M Saxe

Title: Strategy Coopetition Explains the Emergence and Transience of In-Context Learning

Abstract:
In-context learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. In this work, we sought a mechanistic understanding of these transient dynamics. Firstly, we find that—after the disappearance of ICL—the asymptotic strategy is a remarkable hybrid between in-weights and in-context learning, which we term “context-constrained in-weights learning�?(CIWL). CIWL is in competition with ICL, and eventually replaces it as the dominant strategy of the model (thus leading to ICL transience). However, we also find that the two competing strategies actually share sub-circuits, which gives rise to cooperative dynamics as well. For example, in our setup, ICL is unable to emerge quickly on its own, and can only be enabled through the simultaneous slow development of asymptotic CIWL. CIWL thus both cooperates and competes with ICL, a phenomenon we term “strategy coopetition�? We propose a minimal mathematical model that reproduces these key dynamics and interactions. Informed by this model, we were able to identify a setup where ICL is truly emergent and persistent.

Paperid:69 oral

Authors:Wenbin Wang, Yongcheng Jing, Liang Ding, Yingjie Wang, Li Shen, Yong Luo, Bo Du, Dacheng Tao

Title: Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG

Abstract:
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To drive progress beyond the limits of heuristic methods, this paper advances HR perception capabilities of MLLMs by harnessing cutting-edge long-context techniques such as retrieval-augmented generation (RAG). Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench. Code is available athttps://github.com/DreamMr/RAP.

Paperid:70 oral

Authors:Haibo Chen, Xin Wang, Zeyang Zhang, Haoyang Li, Ling Feng, Wenwu Zhu

Title: AutoGFM: Automated Graph Foundation Model with Adaptive Architecture Customization

Abstract:
Graph foundation models (GFMs) aim to share graph knowledge across diverse domains and tasks to boost graph machine learning. However, existing GFMs rely on hand-designed and fixed graph neural network (GNN) architectures, failing to utilize optimal architecturesw.r.t.specific domains and tasks, inevitably leading to suboptimal performance in diverse graph domains and tasks. In this paper, we explore graph neural architecture search (GNAS) for GFMs for the first time, which suffers from the problem ofarchitecture inconsistency, i.e., the optimal architectures for different tasks and domains vary. We tackle this problem by discovering an invariant graph-architecture relationship across domains and tasks, which imposes three challenges: i) how to capture invariant and variant patterns; ii) how to customize architectures to adapt to diverse domains and tasks; iii) how to mitigate the data domination phenomenon during the architecture search process. To address these challenges, we proposeAutomatedGraphFoundationModel with Adaptive Architecture Customization (AutoGFM), providing a theoretical analysis to demonstrate the limitations of existing GNAS. Specifically, we first propose a disentangled contrastive graph encoder to learn invariant and variant patterns. Then, we design an invariant-guided architecture customization strategy to customize architectures for data from diverse domains and tasks. Finally, we propose a curriculum architecture customization mechanism to mitigate the phenomenon of particular data dominating the search process. Extensive experiments demonstrate thatAutoGFMoutperforms baselines, achieving state-of-the-art performance.

Paperid:71 oral

Authors:Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng

Title: Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Abstract:
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.

Paperid:72 oral

Authors:Nicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, Florian Tramèr

Title: AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Abstract:
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4's 30%. We make this benchmark available athttps://github.com/ethz-spylab/AutoAdvExBench.

Paperid:73 oral

Authors:Xiang Fu, Brandon M Wood, Luis Barroso-Luque, Daniel S. Levine, Meng Gao, Misko Dzamba, C. Lawrence Zitnick

Title: Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction

Abstract:
Machine learning interatomic potentials (MLIPs) have become increasingly effective at approximating quantum mechanical calculations at a fraction of the computational cost. However, lower errors on held out test sets do not always translate to improved results on downstream physical property prediction tasks. In this paper, we propose testing MLIPs on their practical ability to conserve energy during molecular dynamic simulations. If passed, improved correlations are found between test errors and their performance on physical property prediction tasks. We identify choices which may lead to models failing this test, and use these observations to improve upon highly-expressive models. The resulting model, eSEN, provides state-of-the-art results on a range of physical property prediction tasks, including materials stability prediction, thermal conductivity prediction, and phonon calculations.

Paperid:74 oral

Authors:Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, Atish Agarwala

Title: Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks

Abstract:
Understanding neural network training dynamics at scale is an important open problem. Although realistic model architectures, optimizers, and data interact in complex ways that make predictive theory challenging, we show that compute-optimally trained models exhibit remarkably precise collective regularities. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, discrepancies between normalized curves fall below the noise floor of individual models' loss curves across random seeds, yielding an exceptionally tight collapse we term "supercollapse." We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction. This collapse breaks down when hyperparameters are scaled suboptimally, providing a practical indicator of proper scaling. We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws, and analyzing a simple but effective model of SGD noise dynamics that accurately captures how learning rate schedules deform loss curves away from power laws while preserving universality, and why learning rate decay suppresses variance to enable supercollapse.

Paperid:75 oral

Authors:Nuojin Cheng, Leonard Papenmeier, Stephen Becker, Luigi Nardi

Title: A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization

Abstract:
Bayesian optimization is a widely used method for optimizing expensive black-box functions, with Expected Improvement being one of the most commonly used acquisition functions. In contrast, information-theoretic acquisition functions aim to reduce uncertainty about the function’s optimum and are often considered fundamentally distinct from EI. In this work, we challenge this prevailing perspective by introducing a unified theoretical framework, Variational Entropy Search, which reveals that EI and information-theoretic acquisition functions are more closely related than previously recognized. We demonstrate that EI can be interpreted as a variational inference approximation of the popular information-theoretic acquisition function, named Max-value Entropy Search. Building on this insight, we propose VES-Gamma, a novel acquisition function that balances the strengths of EI and MES. Extensive empirical evaluations across both low- and high-dimensional synthetic and real-world benchmarks demonstrate that VES-Gamma is competitive with state-of-the-art acquisition functions and in many cases outperforms EI and MES.

Paperid:76 oral

Authors:Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K. Reddy

Title: LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Abstract:
Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect actual discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorization, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods on LLM-SRBench, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

Paperid:77 oral

Authors:Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, Sitan Chen

Title: Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Abstract:
In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to $\approx 90$%, even outperforming ARMs that were explicitly trained via teacher forcing to learn the right order of decoding.

Paperid:78 oral

Authors:Antoine Wehenkel, Juan L. Gamella, Ozan Sener, Jens Behrmann, Guillermo Sapiro, Joern-Henrik Jacobsen, marco cuturi

Title: Addressing Misspecification in Simulation-based Inference through Data-driven Calibration

Abstract:
Driven by steady progress in deep generative modeling, simulation-based inference (SBI) has emerged as the workhorse for inferring the parameters of stochastic simulators. However, recent work has demonstrated that model misspecification can harm SBI's reliability, preventing its adoption in important applications where only misspecified simulators are available. This work introduces robust posterior estimation (RoPE), a framework that overcomes model misspecification with a small real-world calibration set of ground truth parameter measurements. We formalize the misspecification gap as the solution of an optimal transport (OT) problem between learned representations of real-world and simulated observations, allowing RoPE to learn a model of the misspecification without placing additional assumptions on its nature. RoPE shows how the calibration set and OT together offer a controllable balance between calibrated uncertainty and informative inference even under severely misspecified simulators. Results on four synthetic tasks and two real-world problems with ground-truth labels demonstrate that RoPE outperforms baselines and consistently returns informative and calibrated credible intervals.

Paperid:79 oral

Authors:Shiqing Gao, Jiaxin Ding, Luoyi Fu, Xinbing Wang

Title: Controlling Underestimation Bias in Constrained Reinforcement Learning for Safe Exploration

Abstract:
Constrained Reinforcement Learning (CRL) aims to maximize cumulative rewards while satisfying constraints. However, existing CRL algorithms often encounter significant constraint violations during training, limiting their applicability in safety-critical scenarios. In this paper, we identify the underestimation of the cost value function as a key factor contributing to these violations. To address this issue, we propose the Memory-driven Intrinsic Cost Estimation (MICE) method, which introduces intrinsic costs to mitigate underestimation and control bias to promote safer exploration. Inspired by flashbulb memory, where humans vividly recall dangerous experiences to avoid risks, MICE constructs a memory module that stores previously explored unsafe states to identify high-cost regions. The intrinsic cost is formulated as the pseudo-count of the current state visiting these risk regions. Furthermore, we propose an extrinsic-intrinsic cost value function that incorporates intrinsic costs and adopts a bias correction strategy. Using this function, we formulate an optimization objective within the trust region, along with corresponding optimization methods. Theoretically, we provide convergence guarantees for the proposed cost value function and establish the worst-case constraint violation for the MICE update. Extensive experiments demonstrate that MICE significantly reduces constraint violations while preserving policy performance comparable to baselines.

Paperid:80 oral

Authors:Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke

Title: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Abstract:
We introduce SWE-Lancer, a benchmark of over 1400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks �?ranging from \$50 bug fixes to \$32000 feature implementations �?and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split. By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

Paperid:81 oral

Authors:Chengmei Niu, Zhenyu Liao, Zenan Ling, Michael W. Mahoney

Title: Fundamental Bias in Inverting Random Sampling Matrices with Application to Sub-sampled Newton

Abstract:
A substantial body of work in machine learning (ML) and randomized numerical linear algebra (RandNLA) has exploited various sorts of random sketching methodologies, including random sampling and random projection, with much of the analysis using Johnson--Lindenstrauss and subspace embedding techniques. Recent studies have identified the issue ofinversion bias-- the phenomenon that inverses of random sketches arenotunbiased, despite the unbiasedness of the sketches themselves. This bias presents challenges for the use of random sketches in various ML pipelines, such as fast stochastic optimization, scalable statistical estimators, and distributed optimization. In the context of random projection, the inversion bias can be easily corrected for dense Gaussian projections (which are, however, too expensive for many applications). Recent work has shown how the inversion bias can be corrected for sparse sub-gaussian projections. In this paper, we show how the inversion bias can be corrected for random sampling methods, both uniform and non-uniform leverage-based, as well as for structured random projections, including those based on the Hadamard transform. Using these results, we establish problem-independent local convergence rates for sub-sampled Newton methods.

Paperid:82 oral

Authors:Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin

Title: Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Abstract:
Neural networks trained to solve modular arithmetic tasks exhibit grokking, a phenomenon where the test accuracy starts improving long after the model achieves 100% training accuracy in the training process. It is often taken as an example of "emergence", where model ability manifests sharply through a phase transition. In this work, we show that the phenomenon of grokking is not specific to neural networks nor to gradient descent-based optimization. Specifically, we show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines (RFM), an iterative algorithm that uses the Average Gradient Outer Product (AGOP) to enable task-specific feature learning with general machine learning models. When used in conjunction with kernel machines, iterating RFM results in a fast transition from random, near zero, test accuracy to perfect test accuracy. This transition cannot be predicted from the training loss, which is identically zero, nor from the test loss, which remains constant in initial iterations. Instead, as we show, the transition is completely determined by feature learning: RFM gradually learns block-circulant features to solve modular arithmetic. Paralleling the results for RFM, we show that neural networks that solve modular arithmetic also learn block-circulant features. Furthermore, we present theoretical evidence that RFM uses such block-circulant features to implement the Fourier Multiplication Algorithm, which prior work posited as the generalizing solution neural networks learn on these tasks. Our results demonstrate that emergence can result purely from learning task-relevant features and is not specific to neural architectures nor gradient descent-based optimization methods. Furthermore, our work provides more evidence for AGOP as a key mechanism for feature learning in neural networks.

Paperid:83 oral

Authors:Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard

Title: SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

Abstract:
Multimodal retrieval-augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where models should effectively integrate additional knowledge to generate a response. However, existing vision and language models (VLMs) are not inherently designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training large VLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SKVQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with external knowledge sources to determine the final answer. Compared to previous datasets, SKVQA exhibits 11× more unique questions, greater domain diversity, and a broader spectrum of image sources. Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance. Extensive experiments show that SKVQA serves both as a challenging benchmark for knowledge-based VQA and as an effective training resource for adapting generative multimodal models to context-augmented generation. Our results further indicate that models trained on SKVQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.

Paperid:84 oral

Authors:Ilias Diakonikolas, Mingchen Ma, Lisheng Ren, Christos Tzamos

Title: Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise

Abstract:
We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $\sigma: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels. As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomial Statistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.

Paperid:85 oral

Authors:Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan

Title: Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction

Abstract:
We design a suite of minimal algorithmic tasks that are a loose abstraction ofopen-endedreal-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-endedstochasticplanning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic and memorizes excessively; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbedseed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available underhttps://github.com/chenwu98/algorithmic-creativity

Paperid:86 oral

Authors:Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin

Title: VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Abstract:
Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduceVideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learna joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduceInner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.

Paperid:87 oral

Authors:Weihan Li, Yule Wang, Chengrui Li, Anqi Wu

Title: Learning Time-Varying Multi-Region Brain Communications via Scalable Markovian Gaussian Processes

Abstract:
Understanding and constructing brain communications that capture dynamic communications across multiple regions is fundamental to modern system neuroscience, yet current methods struggle to find time-varying region-level communications or scale to large neural datasets with long recording durations. We present a novel framework using Markovian Gaussian Processes to learn brain communications with time-varying temporal delays from multi-region neural recordings, named Adaptive Delay Model (ADM). Our method combines Gaussian Processes with State Space Models and employs parallel scan inference algorithms, enabling efficient scaling to large datasets while identifying concurrent communication patterns that evolve over time. This time-varying approach captures how brain region interactions shift dynamically during cognitive processes. Validated on synthetic and multi-region neural recordings datasets, our approach discovers both the directionality and temporal dynamics of neural communication. This work advances our understanding of distributed neural computation and provides a scalable tool for analyzing dynamic brain networks. Code is available athttps://github.com/BRAINML-GT/Adaptive-Delay-Model.

Paperid:88 oral

Authors:Filippo Bigi, Marcel F. Langer, Michele Ceriotti

Title: The dark side of the forces: assessing non-conservative force models for atomistic machine learning

Abstract:
The use of machine learning to estimate the energy of a group of atoms, and the forces that drive them to more stable configurations, have revolutionized the fields of computational chemistry and materials discovery. In this domain, rigorous enforcement of symmetry and conservation laws has traditionally been considered essential. For this reason, interatomic forces are usually computed as the derivatives of the potential energy, ensuring energy conservation. Several recent works have questioned this physically constrained approach, suggesting that directly predicting the forces yields a better trade-off between accuracy and computational efficiency -- and that energy conservation can be learned during training. This work investigates the applicability of such non-conservative models in microscopic simulations. We identify and demonstrate several fundamental issues, from ill-defined convergence of geometry optimization to instability in various types of molecular dynamics. Contrary to the case of rotational symmetry, energy conservation is hard to learn, monitor, and correct for. The best approach to exploit the acceleration afforded by direct force prediction might be to use it in tandem with a conservative model, reducing -- rather than eliminating -- the additional cost of backpropagation, but avoiding the pathological behavior associated with non-conservative forces.

Paperid:89 oral

Authors:Lifu Liu, Shiyuan He, Jianhua Guo

Title: An Improved Clique-Picking Algorithm for Counting Markov Equivalent DAGs via Super Cliques Transfer

Abstract:
Efficiently counting Markov equivalent directed acyclic graphs (DAGs) is crucial in graphical causal analysis. Wienöbst et al. (2023) introduced a polynomial-time algorithm, known as the Clique-Picking algorithm, to count the number of Markov equivalent DAGs for a given completed partially directed acyclic graph (CPDAG). This algorithm iteratively selects a root clique, determines fixed orientations with outgoing edges from the clique, and generates the unresolved undirected connected components (UCCGs). In this work, we propose a more efficient approach to UCCG generation by utilizing previously computed results for different root cliques. Our method introduces the concept of super cliques within rooted clique trees, enabling their efficient transfer between trees with different root cliques. The proposed algorithm effectively reduces the computational complexity of the Clique-Picking method, particularly when the number of cliques is substantially smaller than the number of vertices and edges.

Paperid:90 oral

Authors:Chi Zhang, REN Lianhai, Jingpu Cheng, Qianxiao Li

Title: From Weight-Based to State-Based Fine-Tuning: Further Memory Reduction on LoRA with Parallel Control

Abstract:
The LoRA method has achieved notable success in reducing GPU memory usage by applying low-rank updates to weight matrices. Yet, one simple question remains: can we push this reduction even further? Furthermore, is it possible to achieve this while improving performance and reducing computation time? Answering these questions requires moving beyond the conventional weight-centric approach. In this paper, we present a state-based fine-tuning framework that shifts the focus from weight adaptation to optimizing forward states, with LoRA acting as a special example. Specifically, state-based tuning introduces parameterized perturbations to the states within the computational graph, allowing us to control states across an entire residual block. A key advantage of this approach is the potential to avoid storing large intermediate states in models like transformers. Empirical results across multiple architectures—including ViT, RoBERTa, LLaMA2-7B, and LLaMA3-8B—show that our method further reduces memory consumption and computation time while simultaneously improving performance. Moreover, as a result of memory reduction, we explore the feasibility to train 7B/8B models on consumer-level GPUs like Nvidia 3090, without model quantization. The code is available at an anonymous GitHub repository

Paperid:91 oral

Authors:Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, Qingming Huang

Title: ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation viaα-β-Divergence

Abstract:
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model by minimizing the divergence between their output distributions, typically using forward Kullback-Leibler divergence (FKLD) or reverse KLD (RKLD). It has become an effective training paradigm due to the broader supervision information provided by the teacher distribution compared to one-hot labels. We identify that the core challenge in KD lies in balancing two mode-concentration effects: the \textbf{\textit{Hardness-Concentration}} effect, which refers to focusing on modes with large errors, and the \textbf{\textit{Confidence-Concentration}} effect, which refers to focusing on modes with high student confidence. Through an analysis of how probabilities are reassigned during gradient updates, we observe that these two effects are entangled in FKLD and RKLD, but in extreme forms. Specifically, both are too weak in FKLD, causing the student to fail to concentrate on the target class. In contrast, both are too strong in RKLD, causing the student to overly emphasize the target class while ignoring the broader distributional information from the teacher. To address this imbalance, we propose ABKD, a generic framework with $\alpha$-$\beta$-divergence. Our theoretical results show that ABKD offers a smooth interpolation between FKLD and RKLD, achieving a better trade-off between these effects. Extensive experiments on 17 language/vision datasets with 12 teacher-student settings confirm its efficacy.

Paperid:92 oral

Authors:Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, David Harel

Title: Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Abstract:
Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.

Paperid:93 oral

Authors:Jie Hu, Yi-Ting Ma, Do Young Eun

Title: Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs

Abstract:
We propose ahistory-driven target (HDT)framework in Markov Chain Monte Carlo (MCMC) to improve any random walk algorithm on discrete state spaces, such as general undirected graphs, for efficient sampling from target distribution $\boldsymbol{\mu}$. With broad applications in network science and distributed optimization, recent innovations like the self-repellent random walk (SRRW) achieve near-zero variance by prioritizing under-sampled states through transition kernel modifications based on past visit frequencies. However, SRRW's reliance on explicit computation of transition probabilities for all neighbors at each step introduces substantial computational overhead, while its strict dependence on time-reversible Markov chains excludes advanced non-reversible MCMC methods. To overcome these limitations, instead of direct modification of transition kernel, HDT introduces a history-dependent target distribution $\boldsymbol{\pi}[\mathbf{x}]$ to replace the original target $\boldsymbol{\mu}$ in any graph sampler, where $\mathbf{x}$ represents the empirical measure of past visits. This design preserves lightweight implementation by requiring only local information between the current and proposed states and achieves compatibility with both reversible and non-reversible MCMC samplers, while retaining unbiased samples with target distribution $\boldsymbol{\mu}$ and near-zero variance performance. Extensive experiments in graph sampling demonstrate consistent performance gains, and a memory-efficient Least Recently Used (LRU) cache ensures scalability to large general graphs.

Paperid:94 oral

Authors:Angéline Pouget, Mohammad Yaghini, Stephan Rabanser, Nicolas Papernot

Title: Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

Abstract:
Deploying machine learning models in safety-critical domains poses a key challenge: ensuring reliable model performance on downstream user data without access to ground truth labels for direct validation. We propose thesuitability filter, a novel framework designed to detect performance deterioration by utilizingsuitability signals—model output features that are sensitive to covariate shifts and indicative of potential prediction errors. The suitability filter evaluates whether classifier accuracy on unlabeled user data shows significant degradation compared to the accuracy measured on the labeled test dataset. Specifically, it ensures that this degradation does not exceed a pre-specified margin, which represents the maximum acceptable drop in accuracy. To achieve reliable performance evaluation, we aggregate suitability signals for both test and user data and compare these empirical distributions using statistical hypothesis testing, thus providing insights into decision uncertainty. Our modular method adapts to various models and domains. Empirical evaluations across different classification tasks demonstrate that the suitability filter reliably detects performance deviations due to covariate shift. This enables proactive mitigation of potential failures in high-stakes applications.

Paperid:95 oral

Authors:Zhijing Wan, Zhixiang Wang, Zheng Wang, Xin Xu, Shin'ichi Satoh

Title: Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection

Abstract:
One-shot subset selection serves as an effective tool to reduce deep learning training costs by identifying an informative data subset based on the information extracted by an information extractor (IE). Traditional IEs, typically pre-trained on the target dataset, are inherently dataset-dependent. Foundation models (FMs) offer a promising alternative, potentially mitigating this limitation. This work investigates two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well as IEs for subset selection? Extensive experiments uncovered surprising insights: FMs consistently outperform traditional IEs on fine-grained datasets, whereas their advantage diminishes on coarse-grained datasets with noisy labels. Motivated by these finding, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a method tailored for fine-grained image datasets. RAM-APL leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.

Paperid:96 oral

Authors:Anshuman Chhabra, Bo Li, Jian Chen, Prasant Mohapatra, Hongfu Liu

Title: Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models

Abstract:
A core data-centric learning challenge is the identification of training samples that are detrimental to model performance. Influence functions serve as a prominent tool for this task and offer a robust framework for assessing training data influence on model predictions. Despite their widespread use, their high computational cost associated with calculating the inverse of the Hessian matrix pose constraints, particularly when analyzing large-sized deep models. In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection. This transformation not only presents a straightforward and Hessian-free formulation but also provides insights into the role of the gradient in sample impact. Through systematic empirical evaluations, we first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets. We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models. We also extend its use to influential sample identification for fine-tuning Large Language Models.

Paperid:97 oral

Authors:Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang

Title: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Abstract:
Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only $28.9\%$ on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available athttps://embodiedbench.github.io.

Paperid:98 oral

Authors:Junsu Kim, Jaeyeon Kim, Ernest K. Ryu

Title: LoRA Training Provably Converges to a Low-Rank Global Minimum Or It Fails Loudly (But it Probably Won't Fail)

Abstract:
Low-rank adaptation (LoRA) has become a standard approach for fine-tuning large foundation models. However, our theoretical understanding of LoRA remains limited as prior analyses of LoRA's training dynamics either rely on linearization arguments or consider highly simplified setups. In this work, we analyze the LoRA loss landscape without such restrictive assumptions. We define two regimes: a "special regime", which includes idealized setups where linearization arguments hold, and a "generic regime" representing more realistic setups where linearization arguments do not hold. In the generic regime, we show that LoRA training converges to a global minimizer with low rank and small magnitude, or a qualitatively distinct solution with high rank and large magnitude. Finally, we argue that the zero-initialization and weight decay in LoRA training induce an implicit bias toward the low-rank, small-magnitude region of the parameter space—where global minima lie—thus shedding light on why LoRA training usually succeeds in finding global minima.

Paperid:99 oral

Authors:Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, Jianhua Tao

Title: AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

Abstract:
The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level—from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT's robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding:https://github.com/zeroQiaoba/AffectGPT.

Paperid:100 oral

Authors:Yuanhe Zhang, Fanghui Liu, Yudong Chen

Title: LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently

Abstract:
This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA) (Hu et al., 2022) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately—applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation. Code is available at:https://github.com/YuanheZ/LoRA-One.

Paperid:101 oral

Authors:Jasper C.H. Lee, Walter McKelvie, Maoyuan Song, Paul Valiant

Title: All-Purpose Mean Estimation over R: Optimal Sub-Gaussianity with Outlier Robustness and Low Moments Performance

Abstract:
We consider the basic statistical challenge of designing an "all-purpose" mean estimation algorithm that is recommendable across a variety of settings and models. Recent work by [Lee and Valiant 2022] introduced the first 1-d mean estimator whose error in the standard finite-variance+i.i.d. setting is optimal even in its constant factors; experimental demonstration of its good performance was shown by [Gobet et al. 2022]. Yet, unlike for classic (but not necessarily practical) estimators such as median-of-means and trimmed mean, this new algorithm lacked proven robustness guarantees in other settings, including the settings of adversarial data corruption and heavy-tailed distributions with infinite variance. Such robustness is important for practical use cases. This raises a research question: is it possible to have a mean estimator that is robust,withoutsacrificing provably optimal performance in the standard i.i.d. setting? In this work, we show that Lee and Valiant's estimator is in fact an "all-purpose" mean estimator by proving:(A) It is robust to an $\eta$-fraction of data corruption, even in the strong contamination model; it has optimal estimation error $O(\sigma\sqrt{\eta})$ for distributions with variance $\sigma^2$.(B) For distributions with finite $z^\text{th}$ moment, for $z \in (1,2)$, it has optimal estimation error, matching the lower bounds of [Devroye et al. 2016] up to constants.We further show (C) that outlier robustness for 1-d mean estimators in fact implies neighborhood optimality, a notion of beyond worst-case and distribution-dependent optimality recently introduced by [Dang et al. 2023]. Previously, such an optimality guarantee was only known for median-of-means, but now it holds also for all estimators that are simultaneouslyrobustandsub-Gaussian, including Lee and Valiant's, resolving a question raised by Dang et al. Lastly, we show (D) the asymptotic normality and efficiency of Lee and Valiant's estimator, as further evidence for its performance across many settings.

Paperid:102 oral

Authors:Matthew Smart, Alberto Bietti, Anirvan M. Sengupta

Title: In-Context Denoising with One-Layer Transformers: Connections between Attention and Associative Memory Retrieval

Abstract:
We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.

Paperid:103 oral

Authors:Kwangjun Ahn, Gagik Magakyan, Ashok Cutkosky

Title: General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization

Abstract:
This work investigates the effectiveness of schedule-free methods, developed by A. Defazio et al. (NeurIPS 2024), in nonconvex optimization settings, inspired by their remarkable empirical success in training neural networks. Specifically, we show that schedule-free SGD achieves optimal iteration complexity for nonsmooth, non-convex optimization problems. Our proof begins with the development of a general framework for online-to-nonconvex conversion, which converts a given online learning algorithm into an optimization algorithm for nonconvex losses. Our general framework not only recovers existing conversions but also leads to two novel conversion schemes. Notably, one of these new conversions corresponds directly to schedule-free SGD, allowing us to establish its optimality. Additionally, our analysis provides valuable insights into the parameter choices for schedule-free SGD, addressing a theoretical gap that the convex theory cannot explain.

Paperid:104 oral

Authors:Yuhan Ye, Ying Cui, Jingyi Wang

Title: An Online Adaptive Sampling Algorithm for Stochastic Difference-of-convex Optimization with Time-varying Distributions

Abstract:
We propose an online adaptive sampling algorithm for solving stochastic nonsmooth difference-of-convex (DC) problems under time-varying distributions. At each iteration, the algorithm relies solely on data generated from the current distribution and employs distinct adaptive sampling rates for the convex and concave components of the DC function, a novel design guided by our theoretical analysis. We show that, under proper conditions on the convergence of distributions, the algorithm converges subsequentially to DC critical points almost surely. Furthermore, the sample size requirement of our proposed algorithm matches the results achieved in the smooth case or when a measurable subgradient selector is available, both under static distributions. A key element of this analysis is the derivation of a novelO(p/n)pointwise convergence rate (modulo logarithmic factors) for the sample average approximation of subdifferential mappings, wherepis the dimension of the variable andnis the sample size -- a result of independent interest. Numerical experiments confirm that the proposed algorithm is both efficient and effective for addressing stochastic nonsmooth problems.

Paperid:105 oral

Authors:Shogo Iwazaki, Shion Takeno

Title: Improved Regret Analysis in Gaussian Process Bandits: Optimality for Noiseless Reward, RKHS norm, and Non-Stationary Variance

Abstract:
We study the Gaussian process (GP) bandit problem, whose goal is to minimize regret under an unknown reward function lying in some reproducing kernel Hilbert space (RKHS). The maximum posterior variance analysis is vital in analyzing near-optimal GP bandit algorithms such as maximum variance reduction (MVR) and phased elimination (PE). Therefore, we first show the new upper bound of the maximum posterior variance, which improves the dependence of the noise variance parameters of the GP. By leveraging this result, we refine the MVR and PE to obtain (i) a nearly optimal regret upper bound in the noiseless setting and (ii) regret upper bounds that are optimal with respect to the RKHS norm of the reward function. Furthermore, as another application of our proposed bound, we analyze the GP bandit under the time-varying noise variance setting, which is the kernelized extension of the linear bandit with heteroscedastic noise. For this problem, we show that MVR and PE-based algorithms achieve noise variance-dependent regret upper bounds, which matches our regret lower bound.

Paperid:106 oral

Authors:Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, liyunfei, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang

Title: What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Abstract:
As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it improves generalization across environments. We conduct multidimensional evaluations for virtual agents, revealing their performance across various capabilities and paving the way for future advancements. Our project is available athttps://omni-bench.github.io.

Paperid:107 oral

Authors:Erez Peterfreund, Ofir Lindenbaum, Yuval Kluger, Boris Landa

Title: Partition First, Embed Later: Laplacian-Based Feature Partitioning for Refined Embedding and Visualization of High-Dimensional Data

Abstract:
Embedding and visualization techniques are essential for analyzing high-dimensional data, but they often struggle with complex data governed by multiple latent variables, potentially distorting key structural characteristics. This paper considers scenarios where the observed features can be partitioned into mutually exclusive subsets, each capturing a different smooth substructure. In such cases, visualizing the data based on each feature partition can better characterize the underlying processes and structures in the data, leading to improved interpretability. To partition the features, we propose solving an optimization problem that promotes graph Laplacian-based smoothness in each partition, thereby prioritizing partitions with simpler geometric structures. Our approach generalizes traditional embedding and visualization techniques, allowing them to learn multiple embeddings simultaneously. We establish that if several independent or partially dependent manifolds are embedded in distinct feature subsets in high-dimensional space, then our framework can reliably identify the correct subsets with theoretical guarantees. Finally, we demonstrate the effectiveness of our approach in extracting multiple low-dimensional structures and partially independent processes from both simulated and real data.

Paperid:108 oral

Authors:Herman Chau, Helen Jenne, Davis Brown, Jesse He, Mark Raugas, Sara C. Billey, Henry Kvinge

Title: Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics

Abstract:
With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoning-heavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes an open-ended research level question and a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by interpretability analysis or program synthesis with LLMs), and discuss some of the challenges involved in designing datasets like these.

Paperid:109 spotlight poster

Authors:Ruiqi Feng, Chenglei Yu, Wenhao Deng, Peiyan Hu, Tailin Wu

Title: On the Guidance of Flow Matching

Abstract:
Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where generation under energy guidance (abbreviated as guidance in the following) is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found athttps://github.com/AI4Science-WestlakeU/flow_guidance.

Paperid:110 spotlight poster

Authors:Max Hopkins, Shay Moran

Title: The Role of Randomness in Stability

Abstract:
Stability is a central property in learning and statistics promising the output of an algorithm $\mathcal{A}$ does not change substantially when applied to similar datasets $S$ and $S'$. It is an elementary fact that any sufficiently stable algorithm (e.g.\ one returning the same result with high probability, satisfying privacy guarantees, etc.) must be randomized. This raises a natural question: can we quantify \textit{how much} randomness is needed for algorithmic stability? We study the randomness complexity of two influential notions of stability in learning: \textit{replicability} (which promises $\mathcal{A}$ usually outputs the same result when run over samples from the same distribution), and \textit{differential privacy} (which promises the output distribution of $\mathcal{A}$ remains similar under neighboring datasets). In particular, building on the ideas of (Dixon, Pavan, Vander Woude, and Vinodchandran ICML 2024) and (Cannone, Su, and Vadhan ITCS 2024), we prove a "weak-to-strong" boosting theorem for stability in these settings: the randomness complexity of a task $\mathcal{M}$ is tightly controlled by the best replication probability of any \textit{deterministic} algorithm solving $\mathcal{M}$, a parameter known as $\mathcal{M}$'s "global stability" (Chase, Moran, Yehudayoff FOCS 2023). Finally, we use this connection to characterize the randomness complexity of PAC Learning: a class has bounded randomness complexity iff it has finite Littlestone dimension, and moreover scales at worst logarithmically in the excess error of the learner. As a corollary, we resolve a question of (Chase, Chornomaz, Moran, and Yehudayoff STOC 2024) about the error-dependent list-replicability of agnostic learning.

Paperid:111 spotlight poster

Authors:Yizi Zhang, Yanchen Wang, Mehdi Azabou, Alexandre Andre, Zixuan Wang, Hanrui Lyu, International Brain Laboratory, Eva L Dyer, Liam Paninski, Cole Lincoln Hurwitz

Title: Neural Encoding and Decoding at Scale

Abstract:
Recent work has demonstrated that large-scale, multi-animal models are powerful tools for characterizing the relationship between neural activity and behavior. Current large-scale approaches, however, focus exclusively on either predicting neural activity from behavior (encoding) or predicting behavior from neural activity (decoding), limiting their ability to capture the bidirectional relationship between neural activity and behavior. To bridge this gap, we introduce a multimodal, multi-task model that enables simultaneous Neural Encoding and Decoding at Scale (NEDS). Central to our approach is a novel multi-task-masking strategy, which alternates between neural, behavioral, within-modality, and cross-modality masking. We pretrain our method on the International Brain Laboratory (IBL) repeated site dataset, which includes recordings from 83 animals performing the visual decision-making task. In comparison to other large-scale modeling approaches, we demonstrate that NEDS achieves state-of-the-art performance for both encoding and decoding when pretrained on multi-animal data and then fine-tuned on new animals. Surprisingly, NEDS's learned embeddings exhibit emergent properties: even without explicit training, they are highly predictive of the brain regions in each recording. Altogether, our approach is a step towards a foundation model of the brain that enables seamless translation between neural activity and behavior.

Paperid:112 spotlight poster

Authors:Tin Dizdarević, Ravi Hammond, Tobias Gessler, Anisoara Calinescu, Jonathan Cook, Matteo Gallici, Andrei Lupu, Jakob Nicolaus Foerster

Title: Ad-Hoc Human-AI Coordination Challenge

Abstract:
Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop \textit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at \href{https://github.com/FLAIROx/ah2ac2}{https://github.com/FLAIROx/ah2ac2}.

Paperid:113 spotlight poster

Authors:Sally Zhu, Ahmed M Ahmed, Rohith Kuditipudi, Percy Liang

Title: Independence Tests for Language Models

Abstract:
Motivated by liability and intellectual property concerns over open-weight models we consider the following problem: given the weights of two models, can we test whether they were trained independently---i.e., from independent random initializations? We consider two settings:constrainedandunconstrained. In the constrained setting, we make assumptions about model architecture and training and propose statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. We compute the p-values by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures between the original two models versus these copies. We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like a p-value. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.

Paperid:114 spotlight poster

Authors:Yu He, Ellen Vitercik

Title: Primal-Dual Neural Algorithmic Reasoning

Abstract:
Neural Algorithmic Reasoning (NAR) trains neural networks to simulate classical algorithms, enabling structured and interpretable reasoning over complex data. While prior research has predominantly focused on learning exact algorithms for polynomial-time-solvable problems, extending NAR to harder problems remains an open challenge. In this work, we introduce a general NAR framework grounded in the primal-dual paradigm, a classical method for designing efficient approximation algorithms. By leveraging a bipartite representation between primal and dual variables, we establish an alignment between primal-dual algorithms and Graph Neural Networks. Furthermore, we incorporate optimal solutions from small instances to greatly enhance the model’s reasoning capabilities. Our empirical results demonstrate that our model not only simulates but also outperforms approximation algorithms for multiple tasks, exhibiting robust generalization to larger and out-of-distribution graphs. Moreover, we highlight the framework’s practical utility by integrating it with commercial solvers and applying it to real-world datasets.

Paperid:115 spotlight poster

Authors:Jade Garcia Bourrée, Augustin Godinot, Sayan Biswas, Anne-Marie Kermarrec, Erwan Le Merrer, Gilles Tredan, Martijn de Vos, Milos Vujasinovic

Title: Robust ML Auditing using Prior Knowledge

Abstract:
Among the many technical challenges to enforcing AI regulations, one crucial yet underexplored problem is the risk of audit manipulation. This manipulation occurs when a platform deliberately alters its answers to a regulator to pass an audit without modifying its answers to other users. In this paper, we introduce a novel approach to manipulation-proof auditing by taking into account the auditor's prior knowledge of the task solved by the platform. We first demonstrate that regulators must not rely on public priors (e.g. a public dataset), as platforms could easily fool the auditor in such cases. We then formally establish the conditions under which an auditor can prevent audit manipulations using prior knowledge about the ground truth. Finally, our experiments with two standard datasets illustrate the maximum level of unfairness a platform can hide before being detected as malicious. Our formalization and generalization of manipulation-proof auditing with a prior opens up new research directions for more robust fairness audits.

Paperid:116 spotlight poster

Authors:Cosimo Gregucci, Bo Xiong, Daniel Hernández, Lorenzo Loconte, Pasquale Minervini, Steffen Staab, Antonio Vergari

Title: Is Complex Query Answering Really Complex?

Abstract:
Complex query answering (CQA) on knowledge graphs (KGs) is gaining momentum as a challenging reasoning task. In this paper, we show that the current benchmarks for CQA might not be ascomplexas we think, as the way they are built distorts our perception of progress in this field. For example, we find that in these benchmarks most queries (up to 98% for some query types) can be reduced to simpler problems, e.g., link prediction, where only one link needs to be predicted. The performance of state-of-the-art CQA models decreses significantly when such models are evaluated on queries that cannot be reduced to easier types. Thus, we propose a set of more challenging benchmarks composed of queries thatrequiremodels to reason over multiple hops and better reflect the construction of real-world KGs. In a systematic empirical investigation, the new benchmarks show that current methods leave much to be desired from current CQA methods.

Paperid:117 spotlight poster

Authors:Nikolaus H. R. Howe, Ian R. McKenzie, Oskar John Hollinsworth, Michał Zając, Tom Tseng, Aaron David Tucker, Pierre-Luc Bacon, Adam Gleave

Title: Scaling Trends in Language Model Robustness

Abstract:
Increasing model size has unlocked a dazzling array of capabilities in language models. At the same time, even frontier models remain vulnerable to jailbreaks and prompt injections, despite concerted efforts to make them robust. As both attackers and defenders gain access to more compute, and as models become larger, what will be the effect on robustness? We argue that to answer this question requires ascaling lens, which we adopt in an extensive study of language model robustness across several classification tasks, model families, and adversarial attacks. We find that in the absence of explicit safety training, larger models are not consistently more robust; however, scale improves sample efficiency in adversarial training, though it worsens compute efficiency. Further, we find that increasing attack compute smoothly improves attack success rate against both undefended and adversarially trained models. Finally, after exploring robustness transfer across attacks and threat models, we combine attack and defense scaling rates to study the offense-defense balance. We find that while attack scaling outpaces adversarial training across all models studied, larger adversarially trained models might give defense the advantage in the long run. These results underscore the utility of the scaling lens, and provide a paradigm for evaluating future attacks and defenses on frontier models. Code for this project is available athttps://github.com/AlignmentResearch/scaling-llm-robustness-paper.

Paperid:118 spotlight poster

Authors:Chengyuan Li, Liangxiao Jiang, Wenjun Zhang, Liangjun Yu, Huan Zhang

Title: Instance Correlation Graph-based Naive Bayes

Abstract:
Due to its simplicity, effectiveness and robustness, naive Bayes (NB) has continued to be one of the top 10 data mining algorithms. To improve its performance, a large number of improved algorithms have been proposed in the last few decades. However, in addition to Gaussian naive Bayes (GNB), there is little work on numerical attributes. At the same time, none of them takes into account the correlations among instances. To fill this gap, we propose a novel algorithm called instance correlation graph-based naive Bayes (ICGNB). Specifically, it first uses original attributes to construct an instance correlation graph (ICG) to represent the correlations among instances. Then, it employs a variational graph auto-encoder (VGAE) to generate new attributes from the constructed ICG and uses them to augment original attributes. Finally, it weights each augmented attribute to alleviate the attribute redundancy and builds GNB on the weighted attributes. The experimental results on tens of datasets show that ICGNB significantly outperforms its deserved competitors.Our codes and datasets are available athttps://github.com/jiangliangxiao/ICGNB.

Paperid:119 spotlight poster

Authors:Gaotang Li, Yuzhong Chen, Hanghang Tong

Title: Taming Knowledge Conflicts in Language Models

Abstract:
Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads" and "context heads", attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term thesuperposition of contextual information and parametric memory, where highly influential attention heads simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JuICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JuICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JuICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JuICE in these settings. Our code is available athttps://github.com/GaotangLi/JUICE.

Paperid:120 spotlight poster

Authors:Daniel Shao, Richard J. Chen, Andrew H. Song, Joel Runevic, Ming Y. Lu, Tong Ding, Faisal Mahmood

Title: Do Multiple Instance Learning Models Transfer?

Abstract:
Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology for distilling embeddings from gigapixel tissue images into patient-level representations to predict clinical outcomes. However, MIL is frequently challenged by the constraints of working with small, weakly-supervised clinical datasets. Unlike fields such as natural language processing and computer vision, which effectively use transfer learning to improve model quality in data-scarce environments, the transferability of MIL models remains largely unexplored. We conduct the first comprehensive investigation into transfer learning capabilities of pretrained MIL models, evaluating 11 MIL models across 19 pretraining tasks spanning tissue subtyping, cancer grading, and molecular subtype prediction. We observe a substantial performance boost with finetuning pretrained models over training from randomly initialized weights, even with domain differences between pretraining and target tasks. Pretraining on pan-cancer datasets enables consistent generalization across organs and task types compared to single-disease pretraining. Remarkably, this pan-cancer pretraining leads to better transfer than that of a state-of-the-art slide-level foundation model, while using only 6.5% of the training data. These findings indicate that MIL architectures exhibit robust adaptability, offering insights into the benefits of leveraging pretrained models to enhance performance in computational pathology.

Paperid:121 spotlight poster

Authors:Yichuan Deng, Xiaoyu Li, Zhao Song, OMRI WEINSTEIN

Title: Discrepancy Minimization in Input-Sparsity Time

Abstract:
A recent work by [Larsen, SODA 2023] introduced a faster combinatorial alternative to Bansal's SDP algorithm for finding a coloring $x \in \{-1, 1\}^n$ that approximately minimizes the discrepancy $\mathrm{disc}(A, x) := \| A x \|_{\infty}$ of a real-valued $m \times n$ matrix $A$. Larsen's algorithm runs in $\widetilde{O}(mn^2)$ time compared to Bansal's $\widetilde{O}(mn^{4.5})$-time algorithm, with a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of $A$ [Bansal, FOCS 2010]. We present a combinatorial $\widetilde{O}(\mathrm{nnz}(A) + n^3)$-time algorithm with the same approximation guarantee as Larsen's, optimal for tall matrices where $m = \mathrm{poly}(n)$. Using a more intricate analysis and fast matrix multiplication, we further achieve a runtime of $\widetilde{O}(\mathrm{nnz}(A) + n^{2.53})$, breaking the cubic barrier for square matrices and surpassing the limitations of linear-programming approaches [Eldan and Singh, RS&A 2018]. Our algorithm relies on two key ideas: (i) a new sketching technique for finding a projection matrix with a short $\ell_2$-basis using implicit leverage-score sampling, and (ii) a data structure for efficiently implementing the iterative Edge-Walk partial-coloring algorithm [Lovett and Meka, SICOMP 2015], and using an alternative analysis to enable ``lazy'' batch updates with low-rank corrections. Our results nearly close the computational gap between real-valued and binary matrices, for which input-sparsity time coloring was recently obtained by [Jain, Sah and Sawhney, SODA 2023].

Paperid:122 spotlight poster

Authors:Jaesik Yoon, Hyeonseo Cho, Doojin Baek, Yoshua Bengio, Sungjin Ahn

Title: Monte Carlo Tree Diffusion for System 2 Planning

Abstract:
Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)—whose performance naturally improves with inference-time computation scaling—standard diffusion‐based planners offer only limited avenues for the scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree‐structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long‐horizon tasks show that MCTD outperforms diffusion baselines, yielding higher‐quality solutions as inference-time computation increases.

Paperid:123 spotlight poster

Authors:Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk

Title: Investigating Non-Transitivity in LLM-as-a-Judge

Abstract:
Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% $\rightarrow$ 96.4% and 82.1% $\rightarrow$ 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

Paperid:124 spotlight poster

Authors:Amber Xie, Oleh Rybkin, Dorsa Sadigh, Chelsea Finn

Title: Latent Diffusion Planning for Imitation Learning

Abstract:
Recent progress in imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods often rely on learning from large amount of expert demonstrations. To address these shortcomings, we propose Latent Diffusion Planning (LDP), a modular approach consisting of a planner which can leverage action-free demonstrations, and an inverse dynamics model which can leverage suboptimal data, that both operate over a learned latent space. First, we learn a compact latent space through a variational autoencoder, enabling effective forecasting of future states in image-based domains. Then, we train a planner and an inverse dynamics model with diffusion objectives. By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data. On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches, as they cannot leverage such additional data.

Paperid:125 spotlight poster

Authors:Seth Karten, Andy Luu Nguyen, Chi Jin

Title: PokéChamp: an Expert-level Minimax Language Agent

Abstract:
We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. This work establishes Pokémon as a benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multi-agent problems. Videos, code, and dataset are available online.

Paperid:126 spotlight poster

Authors:Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora

Title: On the Power of Context-Enhanced Learning in LLMs

Abstract:
We formalize a new concept for LLMs,context-enhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works.Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can beexponentially more sample-efficientthan standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.

Paperid:127 spotlight poster

Authors:David Fleischer, David A. Stephens, Archer Y. Yang

Title: Generalized Random Forests Using Fixed-Point Trees

Abstract:
We propose a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in large dimensions. While GRFs rely on a gradient-based splitting criterion, which in large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradient-free approach preserves GRF’s theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves a speedup of multiple times over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications.

Paperid:128 spotlight poster

Authors:Hugh Dance, Pierre Glaser, Peter Orbanz, Ryan P Adams

Title: Efficiently Vectorized MCMC on Modern Accelerators

Abstract:
With the advent of automatic vectorization tools (e.g., JAX's vmap), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem---loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like vmap, by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.

Paperid:129 spotlight poster

Authors:John Stewart Fabila Carrasco, He Sun

Title: Signed Laplacians for Constrained Graph Clustering

Abstract:
Given two weighted graphs $G = (V, E, w_G)$ and $H = (V, F, w_H)$ defined on the same vertex set, the constrained clustering problem seeks to find a subset $S \subset V$ that minimises the cut ratio between $w_G(S, V \setminus S)$ and $w_H(S, V \setminus S)$. In this work, we establish a Cheeger-type inequality that relates the solution of the constrained clustering problem to the spectral properties of $ G$ and $H$. To reduce computational complexity, we utilise the signed Laplacian of $H$, streamlining calculations while maintaining accuracy. By solving a generalised eigenvalue problem, our proposed algorithm achieves notable performance improvements, particularly in challenging scenarios where traditional spectral clustering methods struggle. We demonstrate its practical effectiveness through experiments on both synthetic and real-world datasets.

Paperid:130 spotlight poster

Authors:Ashkan Soleymani, Behrooz Tahmasebi, Stefanie Jegelka, Patrick Jaillet

Title: Learning with Exact Invariances in Polynomial Time

Abstract:
We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (as opposed to approximate) invariances in this setting, partially addressing a question posed by Diaz (2025) regarding the avoidance of prohibitively large and computationally intensive group averaging methods in kernel regression with exact invariances. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.

Paperid:131 spotlight poster

Authors:Siyuan Duan, Wenyuan Wu, Peng Hu, Zhenwen Ren, Dezhong Peng, Yuan Sun

Title: CoPINN: Cognitive Physics-Informed Neural Networks

Abstract:
Physics-informed neural networks (PINNs) aim to constrain the outputs and gradients of deep learning models to satisfy specified governing physics equations, which have demonstrated significant potential for solving partial differential equations (PDEs). Although existing PINN methods have achieved pleasing performance, they always treat both easy and hard sample points indiscriminately, especially ones in the physical boundaries. This easily causes the PINN model to fall into undesirable local minima and unstable learning, thereby resulting in an Unbalanced Prediction Problem (UPP). To deal with this daunting problem, we propose a novel framework named Cognitive Physical Informed Neural Network (CoPINN) that imitates the human cognitive learning manner from easy to hard. Specifically, we first employ separable subnetworks to encode independent one-dimensional coordinates and apply an aggregation scheme to generate multi-dimensional predicted physical variables. Then, during the training phase, we dynamically evaluate the difficulty of each sample according to the gradient of the PDE residuals. Finally, we propose a cognitive training scheduler to progressively optimize the entire sampling regions from easy to hard, thereby embracing robustness and generalization against predicting physical boundary regions. Extensive experiments demonstrate that our CoPINN achieves state-of-the-art performance, particularly significantly reducing prediction errors in stubborn regions.

Paperid:132 spotlight poster

Authors:Chen Zhang, Weixin Bu, Zeyi Ren, Zhengwu Liu, Yik Chung WU, Ngai Wong

Title: Nonparametric Teaching for Graph Property Learners

Abstract:
Inferring properties of graph-structured data,e.g., the solubility of molecules, essentially involves learning the implicit mapping from graphs to their properties. This learning process is often costly for graph property learners like Graph Convolutional Networks (GCNs). To address this, we propose a paradigm called Graph Nonparametric Teaching (GraNT) that reinterprets the learning process through a novel nonparametric teaching perspective. Specifically, the latter offers a theoretical framework for teaching implicitly defined (i.e., nonparametric) mappings via example selection. Such an implicit mapping is realized by a dense set of graph-property pairs, with the GraNT teacher selecting a subset of them to promote faster convergence in GCN training. By analytically examining the impact of graph structure on parameter-based gradient descent during training, and recasting the evolution of GCNs—shaped by parameter updates—through functional gradient descent in nonparametric teaching, we showfor the first timethat teaching graph property learners (i.e., GCNs) is consistent with teaching structure-aware nonparametric learners. These new findings readily commit GraNT to enhancing learning efficiency of the graph property learner, showing significant reductions in training time for graph-level regression (-36.62%), graph-level classification (-38.19%), node-level regression (-30.97%) and node-level classification (-47.30%), all while maintaining its generalization performance.

Paperid:133 spotlight poster

Authors:Moshe Eliasof, Alessio Gravina, Andrea Ceni, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schönlieb

Title: Graph Adaptive Autoregressive Moving Average Models

Abstract:
Graph State Space Models (SSMs) have recently been introduced to enhance Graph Neural Networks (GNNs) in modeling long-range interactions. Despite their success, existing methods either compromise on permutation equivariance or limit their focus to pairwise interactions rather than sequences. Building on the connection between Autoregressive Moving Average (ARMA) and SSM, in this paper, we introduce GRAMA, a Graph Adaptive method based on a learnable ARMA framework that addresses these limitations. By transforming from static to sequential graph data, GRAMA leverages the strengths of the ARMA framework, while preserving permutation equivariance. Moreover, GRAMA incorporates a selective attention mechanism for dynamic learning of ARMA coefficients, enabling efficient and flexible long-range information propagation. We also establish theoretical connections between GRAMA and Selective SSMs, providing insights into its ability to capture long-range dependencies. Experiments on 26 synthetic and real-world datasets demonstrate that GRAMA consistently outperforms backbone models and performs competitively with state-of-the-art methods.

Paperid:134 spotlight poster

Authors:Shanshan Luo, Yu yixuan, Chunchen LIU, Feng Xie, Zhi Geng

Title: Causal Attribution Analysis for Continuous Outcomes

Abstract:
Previous studies have extensively addressed the attribution problem for binary outcome variables. However, in many practical scenarios, the outcome variable is continuous, and simply binarizing it may result in information loss or biased conclusions. To address this issue, we propose a series of posterior causal estimands for retrospectively evaluating multiple correlated causes from a continuous outcome. These estimands include posterior intervention effects, posterior total causal effects, and posterior natural direct effects. Under assumptions of sequential ignorability, monotonicity, and perfect positive rank, we show that the posterior causal estimands of interest are identifiable and present the corresponding identification equations. We also provide a simple but effective estimation procedure and establish asymptotic properties of the proposed estimators. An artificial hypertension example and a real developmental toxicity dataset are employed to illustrate our method.

Paperid:135 spotlight poster

Authors:Xianghua Zeng, Hang Su, Zhengyi Wang, Zhiyuan LIN

Title: Graph Diffusion for Robust Multi-Agent Coordination

Abstract:
Offline multi-agent reinforcement learning (MARL) struggles to estimate out-of-distribution states and actions due to the absence of real-time environmental feedback. While diffusion models show promise in addressing these challenges, their application primarily focuses on independently diffusing the historical trajectories of individual agents, neglecting crucial multi-agent coordination dynamics and reducing policy robustness in dynamic environments. In this paper, we propose MCGD, a novel Multi-agent Coordination framework based on Graph Diffusion models to improve the effectiveness and robustness of collaborative policies. Specifically, we begin by constructing a sparse coordination graph that includes continuous node attributes and discrete edge attributes to effectively identify the underlying dynamics of multi-agent interactions. Next, we derive transition probabilities between edge categories and present adaptive categorical diffusion to capture the structure diversity of multi-agent coordination. Leveraging this coordination structure, we define neighbor-dependent forward noise and develop anisotropic diffusion to enhance the action diversity of each agent. Extensive experiments across various multi-agent environments demonstrate that MCGD significantly outperforms existing state-of-the-art baselines in coordination performance and policy robustness in dynamic environments.

Paperid:136 spotlight poster

Authors:Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu

Title: A Closer Look at Multimodal Representation Collapse

Abstract:
We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page:https://abhrac.github.io/mmcollapse/.

Paperid:137 spotlight poster

Authors:Lena Stempfle, Anton Matsson, Newton Mwai, Fredrik D. Johansson

Title: Prediction models that learn to avoid missing values

Abstract:
Handling missing values at test time is challenging for machine learning models, especially when aiming for both high accuracy and interpretability. Established approaches often add bias through imputation or excessive model complexity via missingness indicators. Moreover, either method can obscure interpretability, making it harder to understand how the model utilizes the observed variables in predictions. We proposemissingness-avoiding(MA) machine learning, a general framework for training models to rarely require the values of missing (or imputed) features at test time. We create tailored MA learning algorithms for decision trees, tree ensembles, and sparse linear models by incorporating classifier-specific regularization terms in their learning objectives. The tree-based models leverage contextual missingness by reducing reliance on missing values based on the observed context. Experiments on real-world datasets demonstrate thatMA-DT, MA-LASSO, MA-RF, andMA-GBTeffectively reduce the reliance on features with missing values while maintaining predictive performance competitive with their unregularized counterparts. This shows that our framework gives practitioners a powerful tool to maintain interpretability in predictions with test-time missing values.

Paperid:138 spotlight poster

Authors:Wei Qu, Jiawei Guan, Rui Ma, kezhai, Weikun.Wu, Haobo Wang

Title: P(all-atom) Is Unlocking New Path For Protein Design

Abstract:
We introduce Pallatom, an innovative protein generation model capable of producing protein structures with all-atom coordinates. Pallatom directly learns and models the joint distribution $P(\textit{structure}, \textit{seq})$ by focusing on $P(\textit{all-atom})$, effectively addressing the interdependence between sequence and structure in protein generation. To achieve this, we propose a novel network architecture specifically designed for all-atom protein generation. Our model employs a dual-track framework that tokenizes proteins into token-level and atomic-level representations, integrating them through a multi-layer decoding process with "traversing" representations and recycling mechanism. We also introduce the $\texttt{atom14}$ representation method, which unifies the description of unknown side-chain coordinates, ensuring high fidelity between the generated all-atom conformation and its physical structure. Experimental results demonstrate that Pallatom excels in key metrics of protein design, including designability, diversity, and novelty, showing significant improvements across the board. Our model not only enhances the accuracy of protein generation but also exhibits excellent sampling efficiency, paving the way for future applications in larger and more complex systems.

Paperid:139 spotlight poster

Authors:Michalis Titsias

Title: New Bounds for Sparse Variational Gaussian Processes

Abstract:
Sparse variational Gaussian processes (GPs) construct tractable posterior approximations to GP models. At the core of these methods is the assumption that the true posterior distribution over training function values ${\bf f}$ and inducing variables ${\bf u}$ is approximated by a variational distribution that incorporates the conditional GP prior $p({\bf f} | {\bf u})$ in its factorization. While this assumption is considered as fundamental, we show that for model training we can relax it through the use of a more general variational distribution $q({\bf f} | {\bf u} )$ that depends on $N$ extra parameters, where $N$ is the number of training examples. In GP regression, we can analytically optimize the evidence lower bound over the extra parameters and express a tractable collapsed bound that is tighter than the previous bound. The new bound is also amenable to stochastic optimization and its implementation requires minor modifications to existing sparse GP code. Further, we also describe extensions to non-Gaussian likelihoods. On several datasets we demonstrate that our method can reduce bias when learning the hyperparameters and can lead to better predictive performance.

Paperid:140 spotlight poster

Authors:Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang

Title: DPO Meets PPO: Reinforced Token Optimization for RLHF

Abstract:
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards---a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. We conduct extensive experiments to evaluate \texttt{RTO} against PPO and other direct preference learning algorithms. The results highlight the effectiveness of RTO, with the algorithm outperforming PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at \href{https://github.com/zkshan2002/RTO}{https://github.com/zkshan2002/RTO}.

Paperid:141 spotlight poster

Authors:Xin Chen, Yarden As, Andreas Krause

Title: Learning Safety Constraints for Large Language Models

Abstract:
Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP–short for Safety Polytope–a geometric approach to LLM safety, that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs' representation space.

Paperid:142 spotlight poster

Authors:Junyi Lu, Lili Jiang, Xiaojia Li, Jianbing Fang, Fengjun Zhang, Li Yang, Chun Zuo

Title: Towards Practical Defect-Focused Automated Code Review

Abstract:
The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2× improvement over standard LLMs and a 10× gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.

Paperid:143 spotlight poster

Authors:Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury

Title: Multi-Turn Code Generation Through Single-Step Rewards

Abstract:
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback.

Paperid:144 spotlight poster

Authors:Judy Hanwen Shen, Ellen Vitercik, Anders Wikum

Title: Algorithms with Calibrated Machine Learning Predictions

Abstract:
The field ofalgorithms with predictionsincorporates machine learning advice in the design of online algorithms to improve real-world performance. A central consideration is the extent to which predictions can be trusted—while existing approaches often require users to specify an aggregate trust level, modern machine learning models can provide estimates of prediction-level uncertainty. In this paper, we proposecalibrationas a principled and practical tool to bridge this gap, demonstrating the benefits of calibrated advice through two case studies: theski rentalandonline job schedulingproblems. For ski rental, we design an algorithm that achieves near-optimal prediction-dependent performance and prove that, in high-variance settings, calibrated advice offers more effective guidance than alternative methods for uncertainty quantification. For job scheduling, we demonstrate that using a calibrated predictor leads to significant performance improvements over existing methods. Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions.

Paperid:145 spotlight poster

Authors:Xinyue Zheng, Haowei Lin, Kaichen He, Zihao Wang, QIANG FU, Haobo Fu, Zilong Zheng, Yitao Liang

Title: MCU: An Evaluation Framework for Open-Ended Game Agents

Abstract:
Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce \textit{Minecraft Universe} (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available athttps://github.com/CraftJarvis/MCU.

Paperid:146 spotlight poster

Authors:Dongyeop Lee, Kwanhee Lee, Jinseok Chung, Namhoon Lee

Title: SAFE: Finding Sparse and Flat Minima to Improve Pruning

Abstract:
Sparsifying neural networks often suffers from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress. Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time. Specifically, we formulate pruning as a sparsity-constrained optimization problem where flatness is encouraged as an objective. We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE$^+$. Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to well-established baselines. In addition, SAFE demonstrates resilience to noisy data, making it well-suited for real-world conditions.

Paperid:147 spotlight poster

Authors:Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev

Title: BaxBench: Can LLMs Generate Correct and Secure Backends?

Abstract:
Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.

Paperid:148 spotlight poster

Authors:Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher

Title: Training Deep Learning Models with Norm-Constrained LMOs

Abstract:
In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.

Paperid:149 spotlight poster

Authors:Yikang Chen, Dehui du

Title: Exogenous Isomorphism for Counterfactual Identifiability

Abstract:
This paper investigates $\sim_{\mathcal{L}_3}$-identifiability, a form of complete counterfactual identifiability within the Pearl Causal Hierarchy (PCH) framework, ensuring that all Structural Causal Models (SCMs) satisfying the given assumptions provide consistent answers to all causal questions. To simplify this problem, we introduce exogenous isomorphism and propose $\sim_{\mathrm{EI}}$-identifiability, reflecting the strength of model identifiability required for $\sim_{\mathcal{L}_3}$-identifiability. We explore sufficient assumptions for achieving $\sim_{\mathrm{EI}}$-identifiability in two special classes of SCMs: Bijective SCMs (BSCMs), based on counterfactual transport, and Triangular Monotonic SCMs (TM-SCMs), which extend $\sim_{\mathcal{L}_2}$-identifiability. Our results unify and generalize existing theories, providing theoretical guarantees for practical applications. Finally, we leverage neural TM-SCMs to address the consistency problem in counterfactual reasoning, with experiments validating both the effectiveness of our method and the correctness of the theory.

Paperid:150 spotlight poster

Authors:Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Jasper C.H. Lee, Thanasis Pittas

Title: On Learning Parallel Pancakes with Mostly Uniform Weights

Abstract:
We study the complexity of learning $k$-mixtures of Gaussians ($k$-GMMs) on $\mathbb R^d$. This task is known to have complexity $d^{\Omega(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$-time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.

Paperid:151 spotlight poster

Authors:Kristina Nikolić, Luze Sun, Jie Zhang, Florian Tramèr

Title: The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Abstract:
Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actuallyuseful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term thejailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes jailbreak utility as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available athttps://github.com/ethz-spylab/jailbreak-tax

Paperid:152 spotlight poster

Authors:Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna K Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping

Title: Great Models Think Alike and this Undermines AI Oversight

Abstract:
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to asAI Oversight. We study how model similarity affects both aspects of AI oversight by proposingChance Adjusted Probabilistic Agreement (CAPA)--a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show thatLLM-as-a-judgescores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains fromweak-to-strong generalization. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend--model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Paperid:153 spotlight poster

Authors:Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, Shijin Wang

Title: am-ELO: A Stable Framework for Arena-based LLM Evaluation

Abstract:
Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating’s probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.

Paperid:154 spotlight poster

Authors:Juan D. Correa, Elias Bareinboim

Title: Counterfactual Graphical Models: Constraints and Inference

Abstract:
Graphical models have been widely used as parsimonious encoders of constraints of the underlying probability models. When organized in a structured way, these models can facilitate the derivation of non-trivial constraints, the inference of quantities of interest, and the optimization of their estimands. In particular, causal diagrams allow for the efficient representation of structural constraints of the underlying causal system. In this paper, we introduce an efficient graphical construction called Ancestral Multi-world Networks that is sound and complete for reading counterfactual independences from a causal diagram using d-separation. Moreover, we introduce the counterfactual (ctf-) calculus, which can be used to transform counterfactual quantities using three rules licensed by the constraints encoded in the diagram. This result generalizes Pearl’s celebrated do-calculus from interventional to counterfactual reasoning.

Paperid:155 spotlight poster

Authors:Damian Smith, Harvey Mannering, Antonia Marcu

Title: Functional Alignment Can Mislead: Examining Model Stitching

Abstract:
A common belief in the representational comparison literature is that if two representations can be functionally aligned, they must capture similar information. In this paper we focus on model stitching and show that models can be functionally aligned, but represent very different information. Firstly, we show that discriminative models with very different biases can be stitched together. We then show that models trained to solve entirely different tasks on different data modalities, and even clustered random noise, can be successfully stitched into MNIST or ImageNet-trained models. We end with a discussion of the wider impact of our results on the community's current beliefs. Overall, our paper draws attention to the need to correctly interpret the results of such functional similarity measures and highlights the need for approaches that capture informational similarity.

Paperid:156 spotlight poster

Authors:Thibaut Issenhuth, Sangchul Lee, Ludovic Dos Santos, Jean-Yves Franceschi, Chansoo Kim, Alain Rakotomamonjy

Title: Improving Consistency Models with Generator-Augmented Flows

Abstract:
Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network. They can be learned in two ways: consistency distillation and consistency training. The former relies on the true velocity field of the corresponding differential equation, approximated by a pre-trained neural network. In contrast, the latter uses a single-sample Monte Carlo estimate of this velocity field. The related estimation error induces a discrepancy between consistency distillation and training that, we show, still holds in the continuous-time limit. To alleviate this issue, we propose a novel flow that transports noisy data towards their corresponding outputs derived from a consistency model. We prove that this flow reduces the previously identified discrepancy and the noise-data transport cost. Consequently, our method not only accelerates consistency training convergence but also enhances its overall performance. The code is available athttps://github.com/thibautissenhuth/consistency_GC.

Paperid:157 spotlight poster

Authors:Harikrishna Metta, Venkatesh Babu Radhakrishnan

Title: Discovering a Zero (Zero-Vector Class of Machine Learning)

Abstract:
In Machine learning, separating data into classes is a very fundamental problem. A mathematical framework around the classes is presented in this work to deepen the understanding of classes. The classes are defined as vectors in a Vector Space, where addition corresponds to the union of classes, and scalar multiplication resembles set complement of classes. The Zero-Vector in the vector space corresponds to a class referred to as the Metta-Class. This discovery enables numerous applications. One such application, termed 'clear learning' in this work, focuses on learning the true nature (manifold) of the data instead of merely learning a boundary sufficient for classification. Another application, called 'unary class learning', involves learning a single class in isolation rather than learning by comparing two or more classes. Additionally, 'set operations on classes' is another application highlighted in this work. Furthermore, Continual Learning of classes is facilitated by smaller networks. The Metta-Class enables neural networks to learn only the data manifold; therefore, it can also be used for generation of new data. Results for the key applications are shown using the MNIST dataset. To further strengthen the claims, some results are also produced using the CIFAR-10 and ImageNet-1k embeddings. The code supporting these applications is publicly available at: github.com/hm-4/Metta-Class.

Paperid:158 spotlight poster

Authors:Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, Sujay Sanghavi

Title: Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

Abstract:
Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose asample weighting scheme for the fine-tuning datasolely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace, which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8$% drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4$% more accuracy on the pre-training datasets.

Paperid:159 spotlight poster

Authors:Chenlu Ye, Yujia Jin, Alekh Agarwal, Tong Zhang

Title: Catoni Contextual Bandits are Robust to Heavy-tailed Rewards

Abstract:
Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.

Paperid:160 spotlight poster

Authors:Liang Yang, Yuwei Liu, Jiaming Zhuo, Di Jin, Chuan Wang, Zhen Wang, Xiaochun Cao

Title: Do We Really Need Message Passing in Brain Network Modeling?

Abstract:
Brain network analysis plays a critical role in brain disease prediction and diagnosis. Graph mining tools have made remarkable progress. Graph neural networks (GNNs) and Transformers, which rely on the message-passing scheme, recently dominated this field due to their powerful expressive ability on graph data. Unfortunately, by considering brain network construction using pairwise Pearson’s coefficients between any pairs of ROIs, model analysis and experimental verification reveal thatthe message-passing under both GNNs and Transformers can NOT be fully explored and exploited. Surprisingly, this paper observes the significant performance and efficiency enhancements of the Hadamard product compared to the matrix product, which is the matrix form of message passing, in processing the brain network. Inspired by this finding, a novel Brain Quadratic Network (BQN) is proposed by incorporating quadratic networks, which possess better universal approximation properties. Moreover, theoretical analysis demonstrates that BQN implicitly performs community detection along with representation learning. Extensive evaluations verify the superiority of the proposed BQN compared to the message-passing-based brain network modeling. Source code is available athttps://github.com/LYWJUN/BQN-demo.

Paperid:161 spotlight poster

Authors:Yedi Zhang, Aaditya K Singh, Peter E. Latham, Andrew M Saxe

Title: Training Dynamics of In-Context Learning in Linear Attention

Abstract:
While attention-based models have demonstrated the remarkable ability of in-context learning (ICL), the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We examine two parametrizations of linear self-attention: one with the key and query weights merged as a single matrix (common in theoretical studies), and one with separate key and query matrices (closer to practical settings). For the merged parametrization, we show that the training dynamics has two fixed points and the loss trajectory exhibits a single, abrupt drop. We derive an analytical time-course solution for a certain class of datasets and initialization. For the separate parametrization, we show that the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we provide a theoretical description of how ICL abilities evolve during gradient descent training of linear attention, revealing abrupt acquisition or progressive improvements depending on how the key and query are parametrized.

Paperid:162 spotlight poster

Authors:Geigh Zollicoffer, Kenneth Eaton, Jonathan C Balloch, Julia Kim, Wei Zhou, Robert Wright, Mark Riedl

Title: Novelty Detection in Reinforcement Learning with World Models

Abstract:
Reinforcement learning (RL) using world models has found significant recent successes. However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline. We refer to the sudden change in visual properties or state transitions as novelties. Implementing novelty detection within generated world model frameworks is a crucial task for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents by utilizing the misalignment of the world model's hallucinated states and the true observed states as a novelty score.We provide effective approaches to detecting novelties in a distribution of transitions learned by an agent in a world model. Finally, we show the advantage of our work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RL-focused novelty detection algorithms.

Paperid:163 spotlight poster

Authors:Matthew Ashman, Cristiana Diaconu, Eric Langezaal, Adrian Weller, Richard E. Turner

Title: Gridded Transformer Neural Processes for Spatio-Temporal Data

Abstract:
Effective modelling of large-scale spatio-temporal datasets is essential for many domains, yet existing approaches often impose rigid constraints on the input data, such as requiring them to lie on fixed-resolution grids. With the rise of foundation models, the ability to process diverse, heterogeneous data structures is becoming increasingly important. Neural processes (NPs), particularly transformer neural processes (TNPs), offer a promising framework for such tasks, but struggle to scale to large spatio-temporal datasets due to the lack of an efficient attention mechanism. To address this, we introduce gridded pseudo-token TNPs which employ specialised encoders and decoders to handle unstructured data and utilise a processor comprising gridded pseudo-tokens with efficient attention mechanisms. Furthermore, we develop equivariant gridded TNPs for applications where exact or approximate translation equivariance is a useful inductive bias, improving accuracy and training efficiency. Our method consistently outperforms a range of strong baselines in various synthetic and real-world regression tasks involving large-scale data, while maintaining competitive computational efficiency. Experiments with weather data highlight the potential of gridded TNPs and serve as just one example of a domain where they can have a significant impact.

Paperid:164 spotlight poster

Authors:Artem Moskalev, Mangal Prakash, Junjie Xu, Tianyu Cui, Rui Liao, Tommaso Mansi

Title: Geometric Hyena Networks for Large-scale Equivariant Learning

Abstract:
Processing global geometric context while preserving equivariance is crucial when modeling biological, chemical, and physical systems. Yet, this is challenging due to the computational demands of equivariance and global context at scale. Standard methods such as equivariant self-attention suffer from quadratic complexity, while local methods such as distance-based message passing sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, we introduce Geometric Hyena, the first equivariant long-convolutional model for geometric systems. Geometric Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on all-atom property prediction of large RNA molecules and full protein molecular dynamics, Geometric Hyena outperforms existing equivariant models while requiring significantly less memory and compute that equivariant self-attention. Notably, our model processes the geometric context of $30k$ tokens $20 \times$ faster than the equivariant transformer and allows $72 \times$ longer context within the same budget.

Paperid:165 spotlight poster

Authors:Marc Jourdan, Gizem Yüce, Nicolas Flammarion

Title: Learning Parametric Distributions from Samples and Preferences

Abstract:
Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preferences-based M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of $\mathcal{O}(1/n)$---a significant improvement over the $\Theta(1/\sqrt{n})$ rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward.

Paperid:166 spotlight poster

Authors:Unique Subedi, Ambuj Tewari

Title: On the Benefits of Active Data Collection in Operator Learning

Abstract:
We study active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a mean-zero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. We can achieve arbitrarily fast error convergence rates with sufficiently rapid eigenvalue decay of the covariance kernels. This contrasts with the passive (i.i.d.) data collection strategies, where the convergence rate is never faster than linear decay ($\sim n^{-1}$). In fact, for our setting, we show a \emph{non-vanishing} lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active data collection strategies in operator learning over their passive counterparts.

Paperid:167 spotlight poster

Authors:Ali Ebrahimpour-Boroojeny, Hari Sundaram, Varun Chandrasekaran

Title: Not All Wrong is Bad: Using Adversarial Examples for Unlearning

Abstract:
Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on "exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, "approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random 10% of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.

Paperid:168 spotlight poster

Authors:Xujie Song, Liangfa Chen, Tong Liu, Wenxuan Wang, Yinuo Wang, Shentao Qin, Yinsong Ma, Jingliang Duan, Shengbo Eben Li

Title: LipsNet++: Unifying Filter and Controller into a Policy Network

Abstract:
Deep reinforcement learning (RL) is effective for decision-making and control tasks like autonomous driving and embodied AI. However, RL policies often suffer from the action fluctuation problem in real-world applications, resulting in severe actuator wear, safety risk, and performance degradation. This paper identifies the two fundamental causes of action fluctuation: observation noise and policy non-smoothness. We propose LipsNet++, a novel policy network with Fourier filter layer and Lipschitz controller layer to separately address both causes. The filter layer incorporates a trainable filter matrix that automatically extracts important frequencies while suppressing noise frequencies in the observations. The controller layer introduces a Jacobian regularization technique to achieve a low Lipschitz constant, ensuring smooth fitting of a policy function. These two layers function analogously to the filter and controller in classical control theory, suggesting that filtering and control capabilities can be seamlessly integrated into a single policy network. Both simulated and real-world experiments demonstrate that LipsNet++ achieves the state-of-the-art noise robustness and action smoothness. The code and videos are publicly available athttps://xjsong99.github.io/LipsNet_v2.

Paperid:169 spotlight poster

Authors:Guancheng Wan, Zijie Huang, Wanjia Zhao, Xiao Luo, Yizhou Sun, Wei Wang

Title: Rethink GraphODE Generalization within Coupled Dynamical System

Abstract:
Coupled dynamical systems govern essential phenomena across physics, biology, and engineering, where components interact through complex dependencies. While Graph Ordinary Differential Equations (GraphODE) offer a powerful framework to model these systems, theirgeneralizationcapabilities degrade severely under limited observational training data due to two fundamental flaws: (i) the entanglement of static attributes and dynamic states in the initialization process, and (ii) the reliance on context-specific coupling patterns during training, which hinders performance in unseen scenarios. In this paper, we propose a Generalizable GraphODE with disentanglement and regularization (GREAT) to address these challenges. Through systematic analysis via the Structural Causal Model, we identify backdoor paths that undermine generalization and design two key modules to mitigate their effects. TheDynamic-Static Equilibrium Decoupler (DyStaED)disentangles static and dynamic states via orthogonal subspace projections, ensuring robust initialization. Furthermore, theCausal Mediation for Coupled Dynamics (CMCD)employs variational inference to estimate latent causal factors, reducing spurious correlations and enhancing universal coupling dynamics. Extensive experiments across diverse dynamical systems demonstrate that ours outperforms state-of-the-art methods within both in-distribution and out-of-distribution.

Paperid:170 spotlight poster

Authors:Zhengyu Zhou, Weiwei Liu

Title: An Error Analysis of Flow Matching for Deep Generative Modeling

Abstract:
Continuous Normalizing Flows (CNFs) have proven to be a highly efficient technique for generative modeling of complex data since the introduction of Flow Matching (FM). The core of FM is to learn the constructed velocity fields of CNFs through deep least squares regression. Despite its empirical effectiveness, theoretical investigations of FM remain limited. In this paper, we present the first end-to-end error analysis of CNFs built upon FM. Our analysis shows that for general target distributions with bounded support, the generated distribution of FM is guaranteed to converge to the target distribution in the sense of the Wasserstein-2 distance. Furthermore, the convergence rate is significantly improved under an additional mild Lipschitz condition of the target score function.

Paperid:171 spotlight poster

Authors:Mina Dalirrooyfard, Konstantin Makarychev, Slobodan Mitrović

Title: Sparse-pivot: Dynamic correlation clustering for node insertions

Abstract:
We present a new Correlation Clustering algorithm for a dynamic setting where nodes are added one at a time. In this model, proposed by Cohen-Addad, Lattanzi, Maggiori, and Parotsidis (ICML 2024), the algorithm uses database queries to access the input graph and updates the clustering as each new node is added. Our algorithm has the amortized update time of $\log^{O(1)}(n)$. Its approximation factor is $20+\varepsilon$, which is a substantial improvement over the approximation factor of the algorithm by Cohen-Addad et al.We complement our theoretical findings by empirically evaluating the approximation guarantee of our algorithm. The results show that it outperforms the algorithm by Cohen-Addad et al.~in practice.

Paperid:172 spotlight poster

Authors:Wenjun Zhang, Liangxiao Jiang, Chaoqun Li

Title: TLLC: Transfer Learning-based Label Completion for Crowdsourcing

Abstract:
Label completion serves as a preprocessing approach to handling the sparse crowdsourced label matrix problem, significantly boosting the effectiveness of the downstream label aggregation. In recent advances, worker modeling has been proved to be a powerful strategy to further improve the performance of label completion. However, in real-world scenarios, workers typically annotate only a few instances, leading to insufficient worker modeling and thus limiting the improvement of label completion. To address this issue, we propose a novel transfer learning-based label completion (TLLC) method. Specifically, we first identify all high-confidence instances from the whole crowdsourced data as a source domain and use it to pretrain a Siamese network. The abundant annotated instances in the source domain provide essential knowledge for worker modeling. Then, we transfer the pretrained network to the target domain with the instances annotated by each worker separately, ensuring worker modeling captures unique characteristics of each worker. Finally, we leverage the new embeddings learned by the transferred network to complete each worker’s missing labels. Extensive experiments on several widely used real-world datasets demonstrate the effectiveness of TLLC. Our codes and datasets are available athttps://github.com/jiangliangxiao/TLLC.

Paperid:173 spotlight poster

Authors:Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

Title: Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Abstract:
Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76× faster training and 31× higher inference throughput for 512×512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.

Paperid:174 spotlight poster

Authors:Shentong Mo, Sukmin Yun

Title: GMAIL: Generative Modality Alignment for generated Image Learning

Abstract:
Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined \textit{GMAIL}, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.

Paperid:175 spotlight poster

Authors:Anders Aamand, Fabrizio Boninsegna, Abigail Gentle, Jacob Imola, Rasmus Pagh

Title: Lightweight Protocols for Distributed Private Quantile Estimation

Abstract:
Distributed data analysis is a large and growing field driven by a massive proliferation of user devices, and by privacy concerns surrounding the centralised storage of data. We consider two \emph{adaptive} algorithms for estimating one quantile (e.g.~the median) when each user holds a single data point lying in a domain[B]that can be queried once through a private mechanism; one under local differential privacy (LDP) and another for shuffle differential privacy (shuffle-DP). In the adaptive setting we present anε-LDP algorithm which can estimate any quantile within errorαonly requiringO(log⁡Bε2α2)users, and an(ε,δ)-shuffle DP algorithm requiring onlyO~((1ε2+1α2)log⁡B)users. Prior (nonadaptive) algorithms require more users by several logarithmic factors inB. We further provide a matching lower bound for adaptive protocols, showing that our LDP algorithm is optimal in the low-εregime. Additionally, we establish lower bounds against non-adaptive protocols which paired with our understanding of the adaptive case, proves a fundamental separation between these models.

Paperid:176 spotlight poster

Authors:Quang-Duy Tran, Bao Duong, Phuoc Nguyen, Thin Nguyen

Title: Identifying Causal Direction via Variational Bayesian Compression

Abstract:
Telling apart the cause and effect between two random variables with purely observational data is a challenging problem that finds applications in various scientific disciplines. A key principle utilized in this task is the algorithmic Markov condition, which postulates that the joint distribution, when factorized according to the causal direction, yields a more succinct codelength compared to the anti-causal direction. Previous approaches approximate these codelengths by relying on simple functions or Gaussian processes (GPs) with easily evaluable complexity, compromising between model fitness and computational complexity. To overcome these limitations, we propose leveraging the variational Bayesian learning of neural networks as an interpretation of the codelengths. Consequently, we can enhance the model fitness while promoting the succinctness of the codelengths, while avoiding the significant computational complexity of the GP-based approaches. Extensive experiments on both synthetic and real-world benchmarks in cause-effect identification demonstrate the effectiveness of our proposed method, surpassing the overall performance of related complexity-based and structural causal model regression-based approaches.

Paperid:177 spotlight poster

Authors:Junchao Zhou, Yao Lu, Jie Wen, Guangming Lu

Title: Efficient and Separate Authentication Image Steganography Network

Abstract:
Image steganography hides multiple images for multiple recipients into a single cover image. All secret images are usually revealed without authentication, which reduces security among multiple recipients. It is elegant to design an authentication mechanism for isolated reception. We explore such mechanism through sufficient experiments, and uncover that additional authentication information will affect the distribution of hidden information and occupy more hiding space of the cover image. This severely decreases effectiveness and efficiency in large-capacity hiding. To overcome such a challenge, we first prove the authentication feasibility within image steganography. Then, this paper proposes an image steganography network collaborating with separate authentication and efficient scheme. Specifically, multiple pairs of lock-key are generated during hiding and revealing. Unlike traditional methods, our method has two stages to make appropriate distribution adaptation between locks and secret images, simultaneously extracting more reasonable primary information from secret images, which can release hiding space of the cover image to some extent. Furthermore, due to separate authentication, fused information can be hidden in parallel with a single network rather than traditional serial hiding with multiple networks, which can largely decrease the model size. Extensive experiments demonstrate that the proposed method achieves more secure, effective, and efficient image steganography. Code is available athttps://github.com/Revive624/Authentication-Image-Steganography.

Paperid:178 spotlight poster

Authors:Zi-Hao Zhou, Jun-Jie Wang, Tong Wei, Min-Ling Zhang

Title: Weakly-Supervised Contrastive Learning for Imprecise Class Labels

Abstract:
Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity'' to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available athttps://github.com/Speechless-10308/WSC.

Paperid:179 spotlight poster

Authors:Taehyun Cho, Seokhun Ju, Seungyub Han, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee

Title: Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Abstract:
To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policy-labeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.

Paperid:180 spotlight poster

Authors:Georgy Noarov, Riccardo Fogliato, Martin Andres Bertran, Aaron Roth

Title: Stronger Neyman Regret Guarantees for Adaptive Experimental Design

Abstract:
We study the design of adaptive, sequential experiments for unbiased average treatment effect (ATE) estimation in the design-based potential outcomes setting. Our goal is to develop adaptive designs offeringsublinear Neyman regret, meaning their efficiency must approach that of the hindsight-optimal nonadaptive design. Recent work [Dai et al, 2023] introduced ClipOGD, the first method achievingO~(T)expected Neyman regret under mild conditions. In this work, we propose adaptive designs with substantially stronger Neyman regret guarantees. In particular, we modify ClipOGD to obtain anytimeO~(log⁡T)Neyman regret under natural boundedness assumptions. Further, in the setting where experimental units have pre-treatment covariates, we introduce and study a class of contextual ``multigroup'' Neyman regret guarantees: Given a set of possibly overlapping groups based on the covariates, the adaptive design outperforms each group's best non-adaptive designs. In particular, we develop a contextual adaptive design withO~(T)anytime multigroup Neyman regret. We empirically validate the proposed designs through an array of experiments.

Paperid:181 spotlight poster

Authors:Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

Title: Scaling Test-Time Compute Without Verification or RL is Suboptimal

Abstract:
Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows. We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

Paperid:182 spotlight poster

Authors:Cheng-Yen Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu

Title: Elucidating the Design Space of Multimodal Protein Language Models

Abstract:
Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models. Project page and code:https://bytedance.github.io/dplm/dplm-2.1.

Paperid:183 spotlight poster

Authors:Abdulrahman Diaa, Toluwani Aremu, Nils Lukas

Title: Optimizing Adaptive Attacks against Watermarks for Language Models

Abstract:
Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secretwatermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only againstnon-adaptiveattackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tuneadaptiveattacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training againstanywatermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively tuned paraphrasers athttps://github.com/nilslukas/ada-wm-evasion.

Paperid:184 spotlight poster

Authors:Weiwei Ye, Zhuopeng Xu, Ning Gui

Title: Non-stationary Diffusion For Probabilistic Time Series Forecasting

Abstract:
Due to the dynamics of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this non-stationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step and integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code is available athttps://github.com/wwy155/NsDiff.

Paperid:185 spotlight poster

Authors:Tim Vieira, Benjamin LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Brian DuSell, John Terilla, Timothy J. O'Donnell, Ryan Cotterell

Title: From Language Models over Tokens to Language Models over Characters

Abstract:
Modern language models are internally—and mathematically—distributions overtokenstrings rather thancharacterstrings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.

Paperid:186 spotlight poster

Authors:Matthew Niedoba, Berend Zwartsenberg, Kevin Patrick Murphy, Frank Wood

Title: Towards a Mechanistic Explanation of Diffusion Model Generalization

Abstract:
We propose a simple, training-free mechanism which explains the generalization behaviour of diffusion models. By comparing pre-trained diffusion models to their theoretically optimal empirical counterparts, we identify a shared local inductive bias across a variety of network architectures. From this observation, we hypothesize that network denoisers generalize through localized denoising operations, as these operations approximate the training objective well over much of the training distribution. To validate our hypothesis, we introduce novel denoising algorithms which aggregate local empirical denoisers to replicate network behaviour. Comparing these algorithms to network denoisers across forward and reverse diffusion processes, our approach exhibits consistent visual similarity to neural network outputs, with lower mean squared error than previously proposed methods.

Paperid:187 spotlight poster

Authors:Yuchen Zeng, Tuan Dinh, Wonjun Kang, Andreas C Mueller

Title: TabFlex: Scaling Tabular Learning to Millions with Linear Attention

Abstract:
Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2× speedup compared to TabPFN and a 1.5× speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.

Paperid:188 spotlight poster

Authors:Connor Schenck, Isaac Reid, Mithun George Jacob, Alex Bewley, Joshua Ainslie, David Rendleman, Deepali Jain, Mohit Sharma, Kumar Avinava Dubey, Ayzaan Wahid, Sumeet Singh, René Wagner, Tianli Ding, Chuyuan Fu, Arunkumar Byravan, Jake Varley, Alexey A. Gritsenko, Matthias Minderer, Dmitry Kalashnikov, Jonathan Tompson, et al. (2 additional authors not shown)

Title: Learning the RoPEs: Better 2D and 3D Position Encodings with STRING

Abstract:
We introduceSTRING: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still providesexacttranslation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers. We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here:https://sites.google.com/view/string-robotics.

Paperid:189 spotlight poster

Authors:Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin

Title: Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Abstract:
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

Paperid:190 spotlight poster

Authors:Greyson Brothers

Title: Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs

Abstract:
We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-basedadaptive poolingmethod can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi-agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.

Paperid:191 spotlight poster

Authors:Armin Behnamnia, Gholamali Aminian, Alireza Aghaei, Chengchun Shi, Vincent Y. F. Tan, Hamid R. Rabiee

Title: Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning

Abstract:
Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded $(1+\epsilon)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ for the regret bounds, where $\epsilon\in[0,1]$ and $n$ is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link:https://github.com/armin-behnamnia/lse-offpolicy-learning.

Paperid:192 spotlight poster

Authors:Zhen Liu, Yicheng Luo, Boyuan Li, Emadeldeen Eldele, Min Wu, Qianli Ma

Title: Learning Soft Sparse Shapes for Efficient Time-Series Classification

Abstract:
Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the time-intensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative shapes while discarding others to achieve candidate subsequence sparsification. However, this approach may exclude beneficial shapes and overlook the varying contributions of shapelets to classification performance. To this end, we propose a Soft sparse Shapes (SoftShape) model for efficient time series classification. Our approach mainly introduces soft shape sparsification and soft shape learning blocks. The former transforms shapes into soft representations based on classification contribution scores, merging lower-scored ones into a single shape to retain and differentiate all subsequence information. The latter facilitates intra- and inter-shape temporal pattern learning, improving model efficiency by using sparsified soft shapes as inputs. Specifically, we employ a learnable router to activate a subset of class-specific expert networks for intra-shape pattern learning. Meanwhile, a shared expert network learns inter-shape patterns by converting sparsified shapes into sequences. Extensive experiments show that SoftShape outperforms state-of-the-art methods and produces interpretable results.

Paperid:193 spotlight poster

Authors:Pablo Samuel Castro, Nenad Tomasev, Ankit Anand, Navodita Sharma, Rishika Mohanta, Aparna Dev, Kuba Perlin, Siddhant Jain, Kyle Levin, Noemi Elteto, Will Dabney, Alexander Novikov, Glenn C Turner, Maria K Eckstein, Nathaniel D. Daw, Kevin J Miller, Kim Stachenfeld

Title: Discovering Symbolic Cognitive Models from Human and Animal Behavior

Abstract:
Symbolic models play a key role in cognitive science, expressing computationally precise hypotheses about how the brain implements a cognitive process. Identifying an appropriate model typically requires a great deal of effort and ingenuity on the part of a human scientist. Here, we adapt FunSearch (Romera-Paredes et al. 2024), a recently developed tool that uses Large Language Models (LLMs) in an evolutionary algorithm, to automatically discover symbolic cognitive models that accurately capture human and animal behavior. We consider datasets from three species performing a classic reward-learning task that has been the focus of substantial modeling effort, and find that the discovered programs outperform state-of-the-art cognitive models for each. The discovered programs can readily be interpreted as hypotheses about human and animal cognition, instantiating interpretable symbolic learning and decision-making algorithms. Broadly, these results demonstrate the viability of using LLM-powered program synthesis to propose novel scientific hypotheses regarding mechanisms of human and animal cognition.

Paperid:194 spotlight poster

Authors:Xinyan Liang, Ruijie Sang, Yuhua Qian, Qian Guo, Feijiang Li, Liang Du

Title: Robust Automatic Modulation Classification with Fuzzy Regularization

Abstract:
Automatic Modulation Classification (AMC) serves as a foundational pillar for cognitive radio systems, enabling critical functionalities including dynamic spectrum allocation, non-cooperative signal surveillance, and adaptive waveform optimization. However, practical deployment of AMC faces a fundamental challenge: prediction ambiguity arising from intrinsic similarity among modulation schemes and exacerbated under low signal-to-noise ratio (SNR) conditions. This phenomenon manifests as near-identical probability distributions across confusable modulation types, significantly degrading classification reliability. To address this, we propose Fuzzy Regularization-enhanced AMC (FR-AMC), a novel framework that integrates uncertainty quantification into the classification pipeline. The proposed FR has three features: (1) Explicitly model prediction ambiguity during backpropagation, (2) dynamic sample reweighting through adaptive loss scaling, (3) encourage margin maximization between confusable modulation clusters. Experimental results on benchmark datasets demonstrate that the FR achieves superior classification accuracy and robustness compared to compared methods, making it a promising solution for real-world spectrum management and communication applications.

Paperid:195 spotlight poster

Authors:Xinyan Liang, Shijie Wang, Yuhua Qian, Qian Guo, Liang Du, Bingbing Jiang, Tingjin Luo, Feijiang Li

Title: Trusted Multi-View Classification with Expert Knowledge Constraints

Abstract:
Multi-view classification (MVC) based on the Dempster-Shafer theory has gained significant recognition for its reliability in safety-critical applications. However, existing methods predominantly focus on providing confidence levels for decision outcomes without explaining the reasoning behind these decisions. Moreover, the reliance on first-order statistical magnitudes of belief masses often inadequately capture the intrinsic uncertainty within the evidence. To address these limitations, we propose a novel framework termed Trusted Multi-view Classification Constrained with Expert Knowledge (TMCEK). TMCEK integrates expert knowledge to enhance feature-level interpretability and introduces a distribution-aware subjective opinion mechanism to derive more reliable and realistic confidence estimates. The theoretical superiority of the proposed uncertainty measure over conventional approaches is rigorously established. Extensive experiments conducted on three multi-view datasets for sleep stage classification demonstrate that TMCEK achieves state-of-the-art performance while offering interpretability at both the feature and decision levels. These results position TMCEK as a robust and interpretable solution for MVC in safety-critical domains. The code is available athttps://github.com/jie019/TMCEK_ICML2025.

Paperid:196 spotlight poster

Authors:Yizhuo Chen, Chun-Fu Chen, Hsiang Hsu, Shaohan Hu, Tarek F. Abdelzaher

Title: PASS: Private Attributes Protection with Stochastic Data Substitution

Abstract:
The growing Machine Learning (ML) services require extensive collections of user data, which may inadvertently include people's private information irrelevant to the services. Various studies have been proposed to protect private attributes by removing them from the data while maintaining the utilities of the data for downstream tasks. Nevertheless, as we theoretically and empirically show in the paper, these methods reveal severe vulnerability because of a common weakness rooted in their adversarial training based strategies. To overcome this limitation, we propose a novel approach, PASS, designed to stochastically substitute the original sample with another one according to certain probabilities, which is trained with a novel loss function soundly derived from information-theoretic objective defined for utility-preserving private attributes protection. The comprehensive evaluation of PASS on various datasets of different modalities, including facial images, human activity sensory signals, and voice recording datasets, substantiates PASS's effectiveness and generalizability.

Paperid:197 spotlight poster

Authors:Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, Jaegul Choo

Title: Hyperspherical Normalization for Scalable Deep Reinforcement Learning

Abstract:
Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization. In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains.

Paperid:198 spotlight poster

Authors:Shira Vansover-Hager, Tomer Koren, Roi Livni

Title: Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization

Abstract:
We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = {\widetilde O}(n)$, improving on recent results of Koren et al. (2022) and Schliserman et al. (2024).

Paperid:199 spotlight poster

Authors:Zichen Liu, Guoji Fu, Chao Du, Wee Sun Lee, Min Lin

Title: Continual Reinforcement Learning by Planning with Online World Models

Abstract:
Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow-The-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of $\mathcal{O}(\sqrt{K^2D\log(T)})$ under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.

Paperid:200 spotlight poster

Authors:Xingyu Wu, Jibin Wu, Yu Zhou, Liang Feng, KC Tan

Title: Towards Robustness and Explainability of Automatic Algorithm Selection

Abstract:
Algorithm selection aims to identify the optimal performing algorithm before execution. Existing techniques typically focus on the observed correlations between algorithm performance and meta-features. However, little research has explored the underlying mechanisms of algorithm selection, specifically what characteristics an algorithm must possess to effectively tackle problems with certain feature values. This gap not only limits the explainability but also makes existing models vulnerable to data bias and distribution shift. This paper introduces directed acyclic graph (DAG) to describe this mechanism, proposing a novel modeling paradigm that aligns more closely with the fundamental logic of algorithm selection. By leveraging DAG to characterize the algorithm feature distribution conditioned on problem features, our approach enhances robustness against marginal distribution changes and allows for finer-grained predictions through the reconstruction of optimal algorithm features, with the final decision relying on differences between reconstructed and rejected algorithm features. Furthermore, we demonstrate that, the learned DAG and the proposed counterfactual calculations offer our approach with both model-level and instance-level explainability.

Paperid:201 spotlight poster

Authors:Yuan Deng, Amin Karbasi, Vahab Mirrokni, Renato Paes Leme, Grigoris Velegkas, Song Zuo

Title: Procurement Auctions via Approximately Optimal Submodular Optimization

Abstract:
We study the problem of procurement auctions, in which an auctioneer seeks to acquire services from a group of strategic sellers with private costs. The quality of the services is measured through some submodular function that is known to the auctioneer. Our goal is to design computationally efficient procurement auctions that (approximately) maximize the difference between the quality of the acquired services and the total cost of the sellers, in a way that is incentive compatible (IC) and individual rational (IR) for the sellers, and generates non-negative surplus (NAS) for the auctioneer. {Our contribution is twofold: \textbf{i)} we provide an improved analysis of existing algorithms for non-positive submodular function maximization and \textbf{ii)} we design computationally efficient frameworks that transform submodular function optimization algorithms to mechanisms that are IC and IR for the sellers, NAS for the auctioneer, and approximation-preserving.} Our frameworks are general and work both in the offline setting where the auctioneer can observe the bids and the services of all the sellers simultaneously, and in the online setting where the sellers arrive in an adversarial order and the auctioneer has to make an irrevocable decision whether to purchase their service or not. We further investigate whether it is possible to convert state-of-art submodular optimization algorithms into descending auctions. We focus on the adversarial setting, meaning that the schedule of the descending prices is determined by an adversary. We show that a submodular optimization algorithm satisfying bi-criteria $(1/2,1)$-approximation in welfare can be effectively converted to a descending auction in this setting. We further establish a connection between descending auctions and online submodular optimization. Finally, we demonstrate the practical applications of our frameworks by instantiating them with different state-of-the-art submodular optimization algorithms and comparing their welfare performance through empirical experiments on publicly available datasets that consist of thousands of sellers.

Paperid:202 spotlight poster

Authors:Zheng Li, Zeyu Liu, Feng Xie, Hao Zhang, Chunchen LIU, Zhi Geng

Title: Local Identifying Causal Relations in the Presence of Latent Variables

Abstract:
We tackle the problem of identifying whether a variable is the cause of a specified target using observational data. State-of-the-art causal learning algorithms that handle latent variables typically rely on identifying the global causal structure, often represented as a partial ancestral graph (PAG), to infer causal relationships. Although effective, these approaches are often redundant and computationally expensive when the focus is limited to a specific causal relationship. In this work, we introduce novel local characterizations that are necessary and sufficient for various types of causal relationships between two variables, enabling us to bypass the need for global structure learning. Leveraging these local insights, we develop efficient and fully localized algorithms that accurately identify causal relationships from observational data. We theoretically demonstrate the soundness and completeness of our approach. Extensive experiments on benchmark networks and real-world datasets further validate the effectiveness and efficiency of our method.

Paperid:203 spotlight poster

Authors:Matteo Sesia, Vladimir Svetnik

Title: Doubly Robust Conformalized Survival Analysis with Right-Censored Data

Abstract:
We present a conformal inference method for constructing lower prediction bounds for survival times from right-censored data, extending recent approaches designed for more restrictive type-I censoring scenarios. The proposed method imputes unobserved censoring times using a machine learning model, and then analyzes the imputed data using a survival model calibrated via weighted conformal inference. This approach is theoretically supported by an asymptotic double robustness property. Empirical studies on simulated and real data demonstrate that our method leads to relatively informative predictive inferences and is especially robust in challenging settings where the survival model may be inaccurate.

Paperid:204 spotlight poster

Authors:Elias Kempf, Simon Schrodi, Max Argus, Thomas Brox

Title: When and How Does CLIP Enable Domain and Compositional Generalization?

Abstract:
The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centricandmechanistic analyses, we find that successful generalization requires the learning of sufficiently shared representations in intermediate layers and circuits.

Paperid:205 spotlight poster

Authors:Olga Ovcharenko, Florian Barkmann, Philip Toma, Imant Daunhawer, Julia E Vogt, Sebastian Schelter, Valentina Boeva

Title: scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data

Abstract:
Self-supervised learning (SSL) has proven to be a powerful approach for extracting biologically meaningful representations from single-cell data. To advance our understanding of SSL methods applied to single-cell data, we present scSSL-Bench, a comprehensive benchmark that evaluates nineteen SSL methods. Our evaluation spans nine datasets and focuses on three common downstream tasks: batch correction, cell type annotation, and missing modality prediction. Furthermore, we systematically assess various data augmentation strategies. Our analysis reveals task-specific trade-offs: the specialized single-cell frameworks, scVI, CLAIRE, and the finetuned scGPT excel at uni-modal batch correction, while generic SSL methods, such as VICReg and SimCLR, demonstrate superior performance in cell typing and multi-modal data integration. Random masking emerges as the most effective augmentation technique across all tasks, surpassing domain-specific augmentations. Notably, our results indicate the need for a specialized single-cell multi-modal data integration framework. scSSL-Bench provides a standardized evaluation platform and concrete recommendations for applying SSL to single-cell analysis, advancing the convergence of deep learning and single-cell genomics.

Paperid:206 spotlight poster

Authors:Florian Peter Busch, Roshni Ramanna Kamath, Rupert Mitchell, Wolfgang Stammer, Kristian Kersting, Martin Mundt

Title: Where is the Truth? The Risk of Getting Confounded in a Continual World

Abstract:
A dataset is confounded if it is most easily solved via a spurious correlation which fails to generalize to new data. In this work, we show that, in a continual learning setting where confounders may vary in time across tasks, the challenge of mitigating the effect of confounders far exceeds the standard forgetting problem normally considered. In particular, we provide a formal description of such continual confounders and identify that, in general, spurious correlations are easily ignored when training for all tasks jointly, but it is harder to avoid confounding when they are considered sequentially. These descriptions serve as a basis for constructing a novel CLEVR-based continually confounded dataset, which we term the ConCon dataset. Our evaluations demonstrate that standard continual learning methods fail to ignore the dataset's confounders. Overall, our work highlights the challenges of confounding factors, particularly in continual learning settings, and demonstrates the need for developing continual learning methods to robustly tackle these.

Paperid:207 spotlight poster

Authors:Haoyang Li, Xin Wang, Zeyang Zhang, Zongyuan Wu, Linxin Xiao, Wenwu Zhu

Title: Self-supervised Masked Graph Autoencoder via Structure-aware Curriculum

Abstract:
Self-supervised learning (SSL) on graph-structured data has attracted considerable attention recently. Masked graph autoencoder, as one promising generative graph SSL approach that aims to recover masked parts of the input graph data, has shown great success in various downstream graph tasks. However, existing masked graph autoencoders fail to consider the degree of difficulty of recovering the masked edges that often have different impacts on the model performance, resulting in suboptimal node representations. To tackle this challenge, in this paper, we propose a novel curriculum based self-supervised masked graph autoencoder that is able to capture and leverage the underlying degree of difficulty of data dependencies hidden in edges, and design better mask-reconstruction pretext tasks for learning informative node representations. Specifically, we first design a difficulty measurer to identify the underlying structural degree of difficulty of edges during the masking step. Then, we adopt a self-paced scheduler to determine the order of masking edges, which encourages the graph encoder to learn from easy to difficult parts. Finally, the masked edges are gradually incorporated into the reconstruction pretext task, leading to high-quality node representations. Experiments on several real-world node classification and link prediction datasets demonstrate the superiority of our proposed method over state-of-the-art graph self-supervised learning baselines. This work is the first study of curriculum strategy for masked graph autoencoders, to the best of our knowledge.

Paperid:208 spotlight poster

Authors:Mingjun Wang, Yihan Wen, Bin Sun, Jianan Mu, Juan Li, Xiaoyi Wang, Jing Justin Ye, Bei Yu, Huawei Li

Title: Bridging Layout and RTL: Knowledge Distillation based Timing Prediction

Abstract:
Accurate and efficient timing prediction at the register-transfer level (RTL) remains a fundamental challenge in electronic design automation (EDA), particularly in striking a balance between accuracy and computational efficiency. While static timing analysis (STA) provides high-fidelity results through comprehensive physical parameters, its computational overhead makes it impractical for rapid design iterations. Conversely, existing RTL-level approaches sacrifice accuracy due to the limited physical information available. We propose RTLDistil, a novel cross-stage knowledge distillation framework that bridges this gap by transferring precise physical characteristics from a layout-aware teacher model (Teacher GNN) to an efficient RTL-level student model (Student GNN), both implemented as graph neural networks (GNNs). RTLDistil efficiently predicts key timing metrics, such as arrival time (AT), and employs a multi-granularity distillation strategy that captures timing-critical features at node, subgraph, and global levels. Experimental results demonstrate that RTLDistil achieves significant improvement in RTL-level timing prediction error reduction, compared to state-of-the-art prediction models. This framework enables accurate early-stage timing prediction, advancing EDA's ``left-shift'' paradigm while maintaining computational efficiency. Our code and dataset will be publicly available athttps://github.com/sklp-eda-lab/RTLDistil.

Paperid:209 spotlight poster

Authors:Georg Bökman, David Nordström, Fredrik Kahl

Title: Flopping for FLOPs: Leveraging Equivariance for Computational Efficiency

Abstract:
Incorporating geometric invariance into neural networks enhances parameter efficiency but typically increases computational costs. This paper introduces new equivariant neural networks that preserve symmetry while maintaining a comparable number of floating-point operations (FLOPs) per parameter to standard non-equivariant networks. We focus on horizontal mirroring (flopping) invariance, common in many computer vision tasks. The main idea is to parametrize the feature spaces in terms of mirror-symmetric and mirror-antisymmetric features, i.e., irreps of the flopping group. This decomposes the linear layers to be block-diagonal, requiring half the number of FLOPs. Our approach reduces both FLOPs and wall-clock time, providing a practical solution for efficient, scalable symmetry-aware architectures.

Paperid:210 spotlight poster

Authors:Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

Title: Revisiting Continuity of Image Tokens for Cross-domain Few-shot Learning

Abstract:
Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness of our method in reducing domain gaps and outperforming state-of-the-art works. Codes and models are available athttps://github.com/shuaiyi308/ReCIT.

Paperid:211 spotlight poster

Authors:yeyunchen

Title: Relational Invariant Learning for Robust Solvation Free Energy Prediction

Abstract:
Predicting the solvation free energy of molecules using graph neural networks holds significant potential for advancing drug discovery and the design of novel materials. While previous methods have demonstrated success on independent and identically distributed (IID) datasets, their performance in out-of-distribution (OOD) scenarios remains largely unexplored. We propose a novel Relational Invariant Learning framework (RILOOD) to enhance OOD generalization in solvation free energy prediction. RILOOD comprises three key components: (i) a mixup-based conditional modeling module that integrates diverse environments, (ii) a novel multi-granularity refinement strategy that extends beyond core substructures to enable context-aware representation learning for capturing multi-level interactions, and (iii) an invariant learning mechanism that identifies robust patterns generalizable to unseen environments. Extensive experiments demonstrate that RILOOD significantly outperforms state-of-the-art methods across various distribution shifts, highlighting its effectiveness in improving solvation free energy prediction under diverse conditions.

Paperid:212 spotlight poster

Authors:Tuan Quang Dam, Pascal Stenger, Lukas Schneider, Joni Pajarinen, Carlo D'Eramo, Odalric-Ambrym Maillard

Title: Monte-Carlo Tree Search with Uncertainty Propagation via Optimal Transport

Abstract:
This paper introduces a novel backup strategy for Monte-Carlo Tree Search (MCTS) tailored for highly stochastic and partially observable Markov decision processes. We adopt a probabilistic approach, modeling both value and action-value nodes as Gaussian distributions, to introduce a novel backup operator that computes value nodes as the Wasserstein barycenter of their action-value children nodes; thus, propagating the uncertainty of the estimate across the tree to the root node. We study our novel backup operator when using a novel combination of $L^1$-Wasserstein barycenter with $\alpha$-divergence, by drawing a crucial connection to the generalized mean backup operator. We complement our probabilistic backup operator with two sampling strategies, based on optimistic selection and Thompson sampling, obtaining our Wasserstein MCTS algorithm. We provide theoretical guarantees of asymptotic convergence of $\mathcal{O}(n^{-1/2})$, with $n$ as the number of visited trajectories, to the optimal policy and an empirical evaluation on several stochastic and partially observable environments, where our approach outperforms well-known related baselines.

Paperid:213 spotlight poster

Authors:Hyeongwon Jang, Changhun Kim, Eunho Yang

Title: TIMING: Temporality-Aware Integrated Gradients for Time Series Explanation

Abstract:
Recent explainable artificial intelligence (XAI) methods for time series primarily estimate point-wise attribution magnitudes, while overlooking the directional impact on predictions, leading to suboptimal identification of significant points. Our analysis shows that conventional Integrated Gradients (IG) effectively capture critical points with both positive and negative impacts on predictions. However, current evaluation metrics fail to assess this capability, as they inadvertently cancel out opposing feature contributions. To address this limitation, we propose novel evaluation metrics—Cumulative Prediction Difference (CPD) and Cumulative Prediction Preservation (CPP)—to systematically assess whether attribution methods accurately identify significant positive and negative points in time series XAI. Under these metrics, conventional IG outperforms recent counterparts. However, directly applying IG to time series data may lead to suboptimal outcomes, as generated paths ignore temporal relationships and introduce out-of-distribution samples. To overcome these challenges, we introduce TIMING, which enhances IG by incorporating temporal awareness while maintaining its theoretical properties. Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing time series XAI baselines. Our code is available athttps://github.com/drumpt/TIMING.

Paperid:214 spotlight poster

Authors:Andreas Kalavas, Evangelos Kipouridis, Nithin Varma

Title: Towards Better-than-2 Approximation for Constrained Correlation Clustering

Abstract:
In the Correlation Clustering problem, we are given an undirected graph and are tasked with computing a clustering (partition of the nodes) that minimizes the sum of the number of edges across different clusters and the number of non-edges within clusters. In the constrained version of this problem, the goal is to compute a clustering that satisfies additional hard constraints mandating certain pairs to be in the same cluster and certain pairs to be in different clusters. Constrained Correlation Clustering is APX-Hard, and the best known approximation factor is 3 (van Zuylen et al. [SODA '07]). In this work, we show that in order to obtain a better-than-2 approximation, solving the (exponentially large) Constrained Cluster LP would be sufficient.[The peer-reviewed version of this article claimed an efficient algorithm for solving the Constrained Cluster LP. An error in the proof, that the authors discovered after the review process, led them to revise the results to be conditional on the existence of a valid LP solution.]

Paperid:215 spotlight poster

Authors:Kaizhen Zhu, Mokai Pan, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi

Title: UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control

Abstract:
Recent advances in diffusion bridge models leverage Doob’s $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob’s $h$-transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available athttps://github.com/UniDB-SOC/UniDB/.

Paperid:216 spotlight poster

Authors:Abdulkadir Gokce, Martin Schrimpf

Title: Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream

Abstract:
When trained on large-scale object classification datasets, certain artificial neural network models begin to approximate core object recognition behaviors and neural response patterns in the primate brain. While recent machine learning advances suggest that scaling compute, model size, and dataset size improves task performance, the impact of scaling on brain alignment remains unclear. In this study, we explore scaling laws for modeling the primate visual ventral stream by systematically evaluating over 600 models trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and behavior. We find that while behavioral alignment continues to scale with larger models, neural alignment saturates. This observation remains true across model architectures and training datasets, even though models with stronger inductive biases and datasets with higher-quality images are more compute-efficient. Increased scaling is especially beneficial for higher-level visual areas, where small models trained on few samples exhibit only poor alignment. Our results suggest that while scaling current architectures and datasets might suffice for alignment with human core object recognition behavior, it will not yield improved models of the brain's visual ventral stream, highlighting the need for novel strategies in building brain models.

Paperid:217 spotlight poster

Authors:Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve

Title: RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Abstract:
Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Paperid:218 spotlight poster

Authors:Divya Shyamal, Jiaqi Zhang, Caroline Uhler

Title: Probabilistic Factorial Experimental Design for Combinatorial Interventions

Abstract:
Acombinatorial intervention, consisting of multiple treatments applied to a single unit with potential interactive effects, has substantial applications in fields such as biomedicine, engineering, and beyond. Given $p$ possible treatments, conducting all possible $2^p$ combinatorial interventions can be laborious and quickly becomes infeasible as $p$ increases. Here we introduce theprobabilistic factorial experimental design, formalized from how scientists perform lab experiments. In this framework, the experimenter selects a dosage for each possible treatment and applies it to a group of units. Each unit independently receives a random combination of treatments, sampled from a product Bernoulli distribution determined by the dosages. Additionally, the experimenter can carry out such experiments over multiple rounds, adapting the design in an active manner. We address the optimal experimental design problem within a novel intervention model that imposes bounded-degree interactions between treatments. In the passive setting, we provide a closed-form solution for the near-optimal design. Our results prove that a dosage of $\frac{1}{2}$ for each treatment is optimal up to a factor of $1+O(\frac{\ln(n)}{n})$ for estimating any $k$-way interaction model, regardless of $k$, and imply that $O\big(kp^{3k}\ln(p)\big)$ observations are required to accurately estimate this model. For the multi-round setting, we provide a near-optimal acquisition function that can be numerically optimized. We also explore several extensions of the design problem and finally validate our findings through simulations.

Paperid:219 spotlight poster

Authors:Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh

Title: RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Abstract:
The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter—a draft LLM operating on shortened retrieval contexts—to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2×speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.

Paperid:220 spotlight poster

Authors:Zian Li, Cai Zhou, Xiyuan Wang, Xingang Peng, Muhan Zhang

Title: Geometric Representation Condition Improves Equivariant Molecule Generation

Abstract:
Recent advances in molecular generative models have demonstrated great promise for accelerating scientific discovery, particularly in drug design. However, these models often struggle to generate high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to improve molecular generative models by integrating geometric representation conditions with provable theoretical guarantees. We decompose the generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared with single-stage generation, the easy-to-generate representation in the first stage guides the second stage generation toward a high-quality molecule in a goal-oriented way. Leveraging EDM and SemlaFlow as base generators, we observe significant quality improvements in unconditional molecule generation on the widely used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 50% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations. Furthermore, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while largely preserving the generation quality achieved with 1,000 steps, thereby significantly reducing the generation iterations needed.

Paperid:221 spotlight poster

Authors:Junjie Yao, Zhongwang Zhang, Zhi-Qin John Xu

Title: An Analysis for Reasoning Bias of Language Models with Small Initialization

Abstract:
Transformer-based Large Language Models (LLMs) have revolutionized Natural Language Processing by demonstrating exceptional performance across diverse tasks. This study investigates the impact of the parameter initialization scale on the training behavior and task preferences of LLMs. We discover that smaller initialization scales encourage models to favor reasoning tasks, whereas larger initialization scales lead to a preference for memorization tasks. We validate this reasoning bias via real datasets and meticulously designed anchor functions. Further analysis of initial training dynamics suggests that specific model components, particularly the embedding space and self-attention mechanisms, play pivotal roles in shaping these learning biases. We provide a theoretical framework from the perspective of model training dynamics to explain these phenomena. Additionally, experiments on real-world language tasks corroborate our theoretical insights. This work enhances our understanding of how initialization strategies influence LLM performance on reasoning tasks and offers valuable guidelines for training models.

Paperid:222 spotlight poster

Authors:John Schultz, Jakub Adamek, Matej Jusup, Marc Lanctot, Michael Kaisers, Sarah Perrin, Daniel Hennes, Jeremy Shar, Cannada A. Lewis, Anian Ruoss, Tom Zahavy, Petar Veličković, Laurel Prince, Satinder Singh, Eric Malmi, Nenad Tomasev

Title: Mastering Board Games by External and Internal Planning with Language Models

Abstract:
Advancing planning and reasoning capabilities of Large Language Models (LLMs) is one of the key prerequisites towards unlocking their potential for performing reliably in complex and impactful domains. In this paper, we aim to demonstrate this across board games (Chess, Fischer Random / Chess960, Connect Four, and Hex), and we show that search-based planning can yield significant improvements in LLM game-playing strength. We introduce, compare and contrast two major approaches: Inexternal search, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external game engine, and ininternal search, the model is trained to generate in-context a linearized tree of search and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, reliably capturing the transition and value functions in the respective environments, with minimal hallucinations. We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.

Paperid:223 spotlight poster

Authors:Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts

Title: AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Abstract:
Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.

Paperid:224 spotlight poster

Authors:Emilia Magnani, Marvin Pförtner, Tobias Weber, Philipp Hennig

Title: Linearization Turns Neural Operators into Function-Valued Gaussian Processes

Abstract:
Neural operators generalize neural networks to learn mappings between function spaces from data. They are commonly used to learn solution operators of parametric partial differential equations (PDEs) or propagators of time-dependent PDEs. However, to make them useful in high-stakes simulation scenarios, their inherent predictive error must be quantified reliably. We introduce LUNO, a novel framework for approximate Bayesian uncertainty quantification in trained neural operators. Our approach leverages model linearization to push (Gaussian) weight-space uncertainty forward to the neural operator's predictions. We show that this can be interpreted as a probabilistic version of the concept of currying from functional programming, yielding a function-valued (Gaussian) random process belief. Our framework provides a practical yet theoretically sound way to apply existing Bayesian deep learning methods such as the linearized Laplace approximation to neural operators. Just as the underlying neural operator, our approach is resolution-agnostic by design. The method adds minimal prediction overhead, can be applied post-hoc without retraining the network, and scales to large models and datasets. We evaluate these aspects in a case study on Fourier neural operators.

Paperid:225 spotlight poster

Authors:Guiomar Pescador-Barrios, Sarah Lucie Filippi, Mark van der Wilk

Title: Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough?

Abstract:
Many machine learning models require setting a parameter that controls their size before training, e.g. number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question "How big is big enough?" We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties, and we show that our method performs well across diverse datasets without the need to adjust its hyperparameter, showing it requires less tuning than others.

Paperid:226 spotlight poster

Authors:Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen

Title: ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Abstract:
With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for decoding both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to accelerate inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory usage or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on benchmarks like RULER, LongBench, and models such as Llama-3.1-8B and GLM-4-9B-1M, we demonstrate that it achieves up to 6$\times$ larger batch sizes and 3.04$\times$ higher throughput on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory.

Paperid:227 spotlight poster

Authors:Paul McVay, Sergio Arnaud, Ada Martin, Arjun Majumdar, Krishna Murthy Jatavallabhula, Phillip Thomas, Ruslan Partsey, Daniel Dugas, Abha Gejji, Alexander Sax, Vincent-Pierre Berges, Mikael Henaff, Ayush Jain, Ang Cao, Ishita Prasad, Mrinal Kalakrishnan, Michael Rabbat, Nicolas Ballas, Mido Assran, Oleksandr Maksymets, et al. (2 additional authors not shown)

Title: LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Abstract:
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com

Paperid:228 spotlight poster

Authors:Huanjian Zhou, Masashi Sugiyama

Title: Parallel Simulation for Log-concave Sampling and Score-based Diffusion Models

Abstract:
Sampling from high-dimensional probability distributions is fundamental in machine learning and statistics. As datasets grow larger, computational efficiency becomes increasingly important, particularly in reducingadaptive complexity, namely the number of sequential rounds required for sampling algorithms. While recent works have introduced several parallelizable techniques, they often exhibit suboptimal convergence rates and remain significantly weaker than the latest lower bounds for log-concave sampling. To address this, we propose a novel parallel sampling method that improves adaptive complexity dependence on dimension $d$ reducing it from $\widetilde{\mathcal{O}}(\log^2 d)$ to $\widetilde{\mathcal{O}}(\log d)$. Our approach builds on parallel simulation techniques from scientific computing.

Paperid:229 spotlight poster

Authors:Mahir Labib Dihan, MD Tanvir Hassan, MD TANVIR PARVEZ, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez

Title: MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Abstract:
Recent advancements in foundation models have improved autonomous tool usage and reasoning, but their capabilities in map-based reasoning remain underexplored. To address this, we introduce MapEval, a benchmark designed to assess foundation models across three distinct tasks—textual, API-based, and visual reasoning—through 700 multiple-choice questions spanning 180 cities and 54 countries, covering spatial relationships, navigation, travel planning, and real-world map interactions. Unlike prior benchmarks that focus on simple location queries, MapEval requires models to handle long-context reasoning, API interactions and visual map analysis, making it the most comprehensive evaluation framework for geospatial AI. On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro, none surpasses 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance. These results expose critical gaps in spatial inference, as models struggle with distances, directions, route planning, and place-specific reasoning, highlighting the need for better geospatial AI to bridge the gap between foundation models and real-world navigation.

Paperid:230 spotlight poster

Authors:Mark Schöne, Babak Rahmani, Heiner Kremer, Fabian Falck, Hitesh Ballani, Jannes Gladrow

Title: Implicit Language Models are RNNs: Balancing Parallelization and Expressivity

Abstract:
State-space models (SSMs) and transformers dominate the language modeling landscape. However, they are constrained to a lower computational complexity than classical recurrent neural networks (RNNs), limiting their expressivity. In contrast, RNNs lack parallelization during training, raising fundamental questions about the trade off between parallelization and expressivity. We propose implicit SSMs, which iterate a transformation until convergence to a fixed point. Theoretically, we show that implicit SSMs implement the non-linear state-transitions of RNNs. Empirically, we find that only approximate fixed-point convergence suffices, enabling the design of a scalable training curriculum that largely retains parallelization, with full convergence required only for a small subset of tokens. Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs. We further scale implicit SSMs to natural language reasoning tasks and pretraining of large-scale language models up to 1.3B parameters on 207B tokens - representing, to our knowledge, the largest implicit model trained to date. Notably, our implicit models outperform their explicit counterparts on standard benchmarks. Our code is publicly available at github.com/microsoft/implicit_languagemodels

Paperid:231 spotlight poster

Authors:Antonis Vasileiou, Ben Finkelshtein, Floris Geerts, Ron Levie, Christopher Morris

Title: Covered Forest: Fine-grained generalization analysis of graph neural networks

Abstract:
The expressive power of message-passing graph neural networks (MPNNs) is reasonably well understood, primarily through combinatorial techniques from graph isomorphism testing. However, MPNNs' generalization abilities---making meaningful predictions beyond the training set---remain less explored. Current generalization analyses often overlook graph structure, limit the focus to specific aggregation functions, and assume the impractical, hard-to-optimize $0$-$1$ loss function. Here, we extend recent advances in graph similarity theory to assess the influence of graph structure, aggregation, and loss functions on MPNNs' generalization abilities. Our empirical study supports our theoretical insights, improving our understanding of MPNNs' generalization properties.

Paperid:232 spotlight poster

Authors:James Rowbottom, Georg Maierhofer, Teo Deveney, Eike Hermann Müller, Alberto Paganini, Katharina Schratz, Pietro Lio, Carola-Bibiane Schönlieb, Chris Budd

Title: G-Adaptivity: optimised graph-based mesh relocation for finite element methods

Abstract:
We present a novel, and effective, approach to achieve optimal mesh relocation in finite element methods (FEMs). The cost and accuracy of FEMs is critically dependent on the choice of mesh points. Mesh relocation (r-adaptivity) seeks to optimise the mesh geometry to obtain the best solution accuracy at given computational budget. Classical r-adaptivity relies on the solution of a separate nonlinear ``meshing'' PDE to determine mesh point locations. This incurs significant cost at remeshing, and relies on estimates that relate interpolation- and FEM-error. Recent machine learning approaches have focused on the construction of fast surrogates for such classical methods. Instead, our new approach trains a graph neural network (GNN) to determine mesh point locations by directly minimising the FE solution error from the PDE system Firedrake to achieve higher solution accuracy. Our GNN architecture closely aligns the mesh solution space to that of classical meshing methodologies, thus replacing classical estimates for optimality with a learnable strategy. This allows for rapid and robust training and results in an extremely efficient and effective GNN approach to online r-adaptivity. Our method outperforms both classical, and prior ML, approaches to r-adaptive meshing. In particular, it achieves lower FE solution error, whilst retaining the significant speed-up over classical methods observed in prior ML work.

Paperid:233 spotlight poster

Authors:Ken Liu, Christopher A. Choquette-Choo, Matthew Jagielski, Peter Kairouz, Sanmi Koyejo, Percy Liang, Nicolas Papernot

Title: Language Models May Verbatim Complete Text They Were Not Explicitly Trained On

Abstract:
An important question today is whether a given text was used to train a large language model (LLM). A completion test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the n-gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this n-gram based membership definition can be effectively gamed. We study scenarios where sequences are non-members for a given n and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of n for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of n. Our findings highlight the inadequacy of n-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.

Paperid:234 spotlight poster

Authors:Qihe Huang, Zhengyang Zhou, Kuo Yang, Zhongchao Yi, Xu Wang, Yang Wang

Title: TimeBase: The Power of Minimalism in Efficient Long-term Time Series Forecasting

Abstract:
Long-term time series forecasting (LTSF) has traditionally relied on large parameters to capture extended temporal dependencies, resulting in substantial computational costs and inefficiencies in both memory usage and processing time. However, time series data, unlike high-dimensional images or text, often exhibit temporal pattern similarity and low-rank structures, especially in long-term horizons. By leveraging this structure, models can be guided to focus on more essential, concise temporal data, improving both accuracy and computational efficiency. In this paper, we introduce TimeBase, an ultra-lightweight network to harness the power of minimalism in LTSF. TimeBase 1) extracts core basis temporal components and 2) transforms traditional point-level forecasting into efficient segment-level forecasting, achieving optimal utilization of both data and parameters. Extensive experiments on diverse real-world datasets show that TimeBase achieves remarkable efficiency and secures competitive forecasting performance. Additionally, TimeBase can also serve as a very effective plug-and-play complexity reducer for any patch-based forecasting models. Code is available at \url{https://github.com/hqh0728/TimeBase}.

Paperid:235 spotlight poster

Authors:Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alan Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, Kirill Neklyudov

Title: Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts

Abstract:
While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation.

Paperid:236 spotlight poster

Authors:Alessandro Palma, Sergei Rybakov, Leon Hetzel, Stephan Günnemann, Fabian J Theis

Title: Enforcing Latent Euclidean Geometry in Single-Cell VAEs for Manifold Interpolation

Abstract:
Latent space interpolations are a powerful tool for navigating deep generative models in applied settings. An example is single-cell RNA sequencing, where existing methods model cellular state transitions as latent space interpolations with variational autoencoders, often assuming linear shifts and Euclidean geometry. However, unless explicitly enforced, linear interpolations in the latent space may not correspond to geodesic paths on the data manifold, limiting methods that assume Euclidean geometry in the data representations. We introduce FlatVI, a novel training framework that regularises the latent manifold of discrete-likelihood variational autoencoders towards Euclidean geometry, specifically tailored for modelling single-cell count data. By encouraging straight lines in the latent space to approximate geodesic interpolations on the decoded single-cell manifold, FlatVI enhances compatibility with downstream approaches that assume Euclidean latent geometry. Experiments on synthetic data support the theoretical soundness of our approach, while applications to time-resolved single-cell RNA sequencing data demonstrate improved trajectory reconstruction and manifold interpolation.

Paperid:237 spotlight poster

Authors:Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti

Title: Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

Abstract:
Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available athttps://github.com/GuanZihan/Benign-Samples-Matter.

Paperid:238 spotlight poster

Authors:Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang

Title: ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

Abstract:
Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g.~8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3X speedup over 16-bit baseline. Anonymous code repository available athttps://anonymous.4open.science/r/project-resq-2142.

Paperid:239 spotlight poster

Authors:Huigen Ye, Hua Xu, An Yan, Yaoyang Cheng

Title: Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems

Abstract:
Large Neighborhood Search (LNS) is a widely used method for solving large-scale Mixed Integer Linear Programming (MILP) problems. The effectiveness of LNS crucially depends on the choice of the search neighborhood. However, existing strategies either rely on expert knowledge or computationally expensive Machine Learning (ML) approaches, both of which struggle to scale effectively for large problems. To address this, we propose LLM-LNS, a novel Large Language Model (LLM)-driven LNS framework for large-scale MILP problems. Our approach introduces a dual-layer self-evolutionary LLM agent to automate neighborhood selection, discovering effective strategies with scant small-scale training data that generalize well to large-scale MILPs. The inner layer evolves heuristic strategies to ensure convergence, while the outer layer evolves evolutionary prompt strategies to maintain diversity. Experimental results demonstrate that the proposed dual-layer agent outperforms state-of-the-art agents such as FunSearch and EOH. Furthermore, the full LLM-LNS framework surpasses manually designed LNS algorithms like ACP, ML-based LNS methods like CL-LNS, and large-scale solvers such as Gurobi and SCIP. It also achieves superior performance compared to advanced ML-based MILP optimization frameworks like GNN&GBDT and Light-MILPopt, further validating the effectiveness of our approach.

Paperid:240 spotlight poster

Authors:Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H. Chi, Julian McAuley, Derek Zhiyuan Cheng

Title: ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Abstract:
Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as asetof item features. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Our code is available at:https://github.com/google-deepmind/action_piece.

Paperid:241 spotlight poster

Authors:Marvin F. da Silva, Felix Dangel, Sageev Oore

Title: Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It

Abstract:
The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.

Paperid:242 spotlight poster

Authors:Matteo Zecchin, Sangwoo Park, Osvaldo Simeone

Title: Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

Abstract:
We introduce adaptive learn-then-test (aLTT), an efficient hyperparameter selection procedure that provides finite-sample statistical guarantees on the population risk of AI models. Unlike the existing learn-then-test (LTT) technique, which relies on conventional p-value-based multiple hypothesis testing (MHT), aLTT implements sequential data-dependent MHT with early termination by leveraging e-processes. As a result, aLTT can reduce the number of testing rounds, making it particularly well-suited for scenarios in which testing is costly or presents safety risks. Apart from maintaining statistical validity, in applications such as online policy selection for offline reinforcement learning and prompt engineering, aLTT is shown to achieve the same performance as LTT while requiring only a fraction of the testing rounds.

Paperid:243 spotlight poster

Authors:Xiang Fang, Arvind Easwaran, Blaise Genest

Title: Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

Abstract:
Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many ID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where only a few labeled ID samples are available. Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID image samples, we leverage CLIP, connecting text with images, engineering learnable ID and OOD textual prompts. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts, and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

Paperid:244 spotlight poster

Authors:Jintao Tong, Ran Ma, Yixiong Zou, Guangyao Chen, Yuhua Li, Ruixuan Li

Title: Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation

Abstract:
Cross-domain few-shot segmentation (CD-FSS) is proposed to first pre-train the model on a source-domain dataset with sufficient samples, and then transfer the model to target-domain datasets where only a few training samples are available for efficient finetuning. There are majorly two challenges in this task: (1) the domain gap and (2) finetuning with scarce data. To solve these challenges, we revisit the adapter-based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine-tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and we find the model's inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure-based decoupler instead of loss-based ones like current works, to capture domain-specific information, thereby directing the model's attention towards domain-agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source-domain training, we further design the SAM-SVN method to constrain DFN from learning sample-specific knowledge. On target domains, we freeze the model and fine-tune the DFN to learn knowledge specific to target domains. Extensive experiments demonstrate that our method surpasses the state-of-the-art method in CD-FSS significantly by 2.69% and 4.68% average MIoU in 1-shot and 5-shot scenarios, respectively.

Paperid:245 spotlight poster

Authors:Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou

Title: Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development

Abstract:
The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. A comprehensive set of over 100 experiments demonstrated the suite's usability and extensibility, while also uncovering insights into the interplay between data quality, diversity, model behavior, and computational costs. All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure.

Paperid:246 spotlight poster

Authors:RINA BAO, Shilong Dong, Zhenfang Chen, Sheng He, Ellen Grant, Yangming Ou

Title: Visual and Domain Knowledge for Professional-level Graph-of-Thought Medical Reasoning

Abstract:
Medical Visual Question Answering (MVQA) requires AI models to answer questions related to medical images, offering significant potential to assist medical professionals in evaluating and diagnosing diseases, thereby improving early interventions. However, existing MVQA datasets primarily focus on basic questions regarding visual perception and pattern recognition, without addressing the more complex questions that are critical in clinical diagnosis and decision-making. This paper introduces a new benchmark designed for professional-level medical reasoning, simulating the decision-making process. We achieve this by collecting MRI and clinical data related to Hypoxic-Ischemic Encephalopathy, enriched with expert annotations and insights. Building on this data, we generate clinical question-answer pairs and MRI interpretations to enable comprehensive diagnosis, interpretation, and prediction of neurocognitive outcomes. Our evaluation of current large vision-language models (LVLMs) shows limited performance on this benchmark, highlighting both the challenges of the task and the importance of this benchmark for advancing medical AI. Furthermore, we propose a novel ``Clinical Graph of Thoughts" model, which integrates domain-specific medical knowledge and clinical reasoning processes with the interpretive abilities of LVLMs. The model demonstrates promising results, achieving around 15% absolute gain on the most important neurocognitive outcome task, while the benchmark still reveals substantial opportunities for further research innovation.

Paperid:247 spotlight poster

Authors:Yiyang Fang, Jian Liang, Wenke Huang, He Li, Kehua Su, Mang Ye

Title: Catch Your Emotion: Sharpening Emotion Perception in Multimodal Large Language Models

Abstract:
Multimodal large language models (MLLMs) have achieved impressive progress in tasks such as visual question answering and visual understanding, but they still face significant challenges in emotional reasoning. Current methods to enhance emotional understanding typically rely on fine-tuning or manual annotations, which are resource-intensive and limit scalability. In this work, we focus on improving the ability of MLLMs to capture emotions during the inference phase. Specifically, MLLMs encounter two main issues: they struggle to distinguish between semantically similar emotions, leading to misclassification, and they are overwhelmed by redundant or irrelevant visual information, which distracts from key emotional cues. To address these, we propose Sharpening Emotion Perception in MLLMs (SEPM), which incorporates a Confidence-Guided Coarse-to-Fine Inference framework to refine emotion classification by guiding the model through simpler tasks. Additionally, SEPM employs Focus-on-Emotion Visual Augmentation to reduce visual redundancy by directing the attention of models to relevant emotional cues in images. Experimental results demonstrate that SEPM significantly improves MLLM performance on emotion-related tasks, providing a resource-efficient and scalable solution for emotion recognition.

Paperid:248 spotlight poster

Authors:Gaozheng Pei, Ke Ma, Yingfei Sun, Qianqian Xu, Qingming Huang

Title: Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain

Abstract:
The diffusion-based adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image's amplitude spectrum with the corresponding parts of the adversarial image. For the phase spectrum, we project the phase of the estimated image into a designated range of the adversarial image's phase spectrum, focusing on the low frequencies. Empirical evidence from extensive experiments demonstrates that our method significantly outperforms most current defense methods.

Paperid:249 spotlight poster

Authors:Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, Jundong Li

Title: Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion

Abstract:
Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available onhttps://github.com/zhengzaiyi/RotationSymmetry

Paperid:250 spotlight poster

Authors:Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, Dawei Cheng

Title: G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

Abstract:
Recent advancements in large language model (LLM)-based agents have demonstrated that collective intelligence can significantly surpass the capabilities of individual agents, primarily due to well-crafted inter-agent communication topologies. Despite the diverse and high-performing designs available, practitioners often face confusion when selecting the most effective pipeline for their specific task: \textit{Which topology is the best choice for my task, avoiding unnecessary communication token overhead while ensuring high-quality solution?} In response to this dilemma, we introduce G-Designer, an adaptive, efficient, and robust solution for multi-agent deployment, which dynamically designs task-aware, customized communication topologies. Specifically, G-Designer models the multi-agent system as a multi-agent network, leveraging a variational graph auto-encoder to encode both the nodes (agents) and a task-specific virtual node, and decodes a task-adaptive and high-performing communication topology. Extensive experiments on six benchmarks showcase that G-Designer is: \textbf{(1) high-performing}, achieving superior results on MMLU with accuracy at $84.50\%$ and on HumanEval with pass@1 at $89.90\%$; \textbf{(2) task-adaptive}, architecting communication protocols tailored to task difficulty, reducing token consumption by up to $95.33\%$ on HumanEval; and \textbf{(3) adversarially robust}, defending against agent adversarial attacks with merely $0.3\%$ accuracy drop.

Paperid:251 poster

Authors:Zilin Kang, Chenyuan Hu, Yu Luo, Zhecheng Yuan, Ruijie Zheng, Huazhe Xu

Title: A Forget-and-Grow Strategy for Deep Reinforcement Learning Scaling in Continuous Control

Abstract:
Deep reinforcement learning for continuous control has recently achieved impressive progress. However, existing methods often suffer from primacy bias—a tendency to overfit early experiences stored in the replay buffer—which limits an RL agent’s sample efficiency and generalizability. A common existing approach to mitigate this issue is periodically resetting the agent during training. Yet, even after multiple resets, RL agents could still be impacted by early experiences. In contrast, humans are less susceptible to such bias, partly due toinfantile amnesia, where the formation of new neurons disrupts early memory traces, leading to the forgetting of initial experiences. Inspired by this dual processes of forgetting and growing in neuroscience, in this paper, we proposeForget and Grow(FoG), a new deep RL algorithm with two mechanisms introduced. First,Experience Replay Decay (ER Decay)�?forgetting early experience''—which balances memory by gradually reducing the influence of early experiences. Second

Paperid:252 spotlight poster

Authors:Xingyu Zhou, Yulian Wu, Francesco Orabona

Title: A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO

Abstract:
In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.

Paperid:253 spotlight poster

Authors:Chi-Ning Chou, Hang Le, Yichen Wang, SueYeon Chung

Title: Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry

Abstract:
Integrating task-relevant information into neural representations is a fundamental ability of both biological and artificial intelligence systems. Recent theories have categorized learning into two regimes: the rich regime, where neural networks actively learn task-relevant features, and the lazy regime, where networks behave like random feature models. Yet this simple lazy–rich dichotomy overlooks a diverse underlying taxonomy of feature learning, shaped by differences in learning algorithms, network architectures, and data properties. To address this gap, we introduce an analysis framework to study feature learning via the geometry of neural representations. Rather than inspecting individual learned features, we characterize how task-relevant representational manifolds evolve throughout the learning process. We show, in both theoretical and empirical settings, that as networks learn features, task-relevant manifolds untangle, with changes in manifold geometry revealing distinct learning stages and strategies beyond the lazy–rich dichotomy. This framework provides novel insights into feature learning across neuroscience and machine learning, shedding light on structural inductive biases in neural circuits and the mechanisms underlying out-of-distribution generalization.

Paperid:254 spotlight poster

Authors:yuanhong zhang, Muyao Yuan, Weizhan Zhang, Tieliang Gong, Wen Wen, Jiangyong Ying, Weijie Shi

Title: InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective

Abstract:
The Segment Anything Model (SAM), a vision foundation model, exhibits impressive zero-shot capabilities in general tasks but struggles in specialized domains. Parameter-efficient fine-tuning (PEFT) is a promising approach to unleash the potential of SAM in novel scenarios. However, existing PEFT methods for SAM neglect the domain-invariant relations encoded in the pre-trained model. To bridge this gap, we propose InfoSAM, an information-theoretic approach that enhances SAM fine-tuning by distilling and preserving its pre-trained segmentation knowledge. Specifically, we formulate the knowledge transfer process as two novel mutual information-based objectives: (i) to compress the domain-invariant relation extracted from pre-trained SAM, excluding pseudo-invariant information as possible, and (ii) to maximize mutual information between the relational knowledge learned by the teacher (pre-trained SAM) and the student (fine-tuned model). The proposed InfoSAM establishes a robust distillation framework for PEFT of SAM. Extensive experiments across diverse benchmarks validate InfoSAM's effectiveness in improving SAM family's performance on real-world tasks, demonstrating its adaptability and superiority in handling specialized scenarios. The code and models are available athttps://muyaoyuan.github.io/InfoSAM_Page.

Paperid:255 spotlight poster

Authors:Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen

Title: Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Abstract:
Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data. In experiments, VPP achieves a 18.6% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6% increase in success rates for complex real-world dexterous manipulation tasks. For your convenience, videos can be found athttps://video-prediction-policy.github.io/

Paperid:256 spotlight poster

Authors:Jiayue Liu, Zhongchao Yi, Zhengyang Zhou, Qihe Huang, Kuo Yang, Xu Wang, Yang Wang

Title: SynEVO: A neuro-inspired spatiotemporal evolutional framework for cross-domain adaptation

Abstract:
Discovering regularities from spatiotemporal systems can benefit various scientific and social planning. Current spatiotemporal learners usually train an independent model from a specific source data that leads to limited transferability among sources, where even correlated tasks requires new design and training. The key towards increasing cross-domain knowledge is to enable collective intelligence and model evolution. In this paper, inspired by neuroscience theories, we theoretically derive the increased information boundary via learning cross-domain collective intelligence and propose a Synaptic EVOlutional spatiotemporal network, SynEVO, where SynEVO breaks the model independence and enables cross-domain knowledge to be shared and aggregated. Specifically, we first re-order the sample groups to imitate the human curriculum learning, and devise two complementary learners, elastic common container and task-independent extractor to allow model growth and task-wise commonality and personality disentanglement. Then an adaptive dynamic coupler with a new difference metric determines whether the new sample group should be incorporated into common container to achieve model evolution under various domains. Experiments show that SynEVO improves the generalization capacity by at most 42% under cross-domain scenarios and SynEVO provides a paradigm of NeuroAI for knowledge transfer and adaptation. Code available athttps://github.com/Rodger-Lau/SynEVO.

Paperid:257 spotlight poster

Authors:Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang

Title: Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

Abstract:
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs.

Paperid:258 spotlight poster

Authors:Seung Yul Lee, Hojoon Kim, Yutack Park, Dawoon Jeong, Seungwu Han, Yeonhong Park, Jae W. Lee

Title: FlashTP: Fused, Sparsity-Aware Tensor Product for Machine Learning Interatomic Potentials

Abstract:
Machine Learning Interatomic Potentials (MLIPs) enable efficient molecular dynamics (MD) simulations with high accuracy. While equivariant MLIPs achieve state-of-the-art accuracy, they face significant computational bottlenecks centered around their Tensor-Product layer, which account for up to 75% of training time and cause substantial memory overhead. We present FlashTP, a highly optimized tensor-product library that addresses these inefficiencies through kernel fusion, sparse computation, and path-aggregated execution. FlashTP achieves up to 41.6$\times$ and 60.8$\times$ kernel speedups overe3nnand NVIDIA cuEquivariance, respectively. For SevenNet-l3i5, it delivers 4.2$\times$ and 3.5$\times$ speedup while reducing peak memory usage by 6.3$\times$ and 6.2$\times$ for inference and training, respectively. The code is available athttps://github.com/SNU-ARC/flashTP.

Paperid:259 spotlight poster

Authors:Pedro Pinto Santos, Alberto Sardinha, Francisco S. Melo

Title: The Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes

Abstract:
The general-utility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy. In this work, we contribute with the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs. We show that, as opposed to standard MDPs, the number of trials plays a key-role in infinite-horizon GUMDPs and the expected performance of a given policy depends, in general, on the number of trials. We consider both discounted and average GUMDPs, where the objective function depends, respectively, on discounted and average frequencies of visitation of state-action pairs. First, we study policy evaluation under discounted GUMDPs, proving lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs. Second, we address average GUMDPs, studying how different classes of GUMDPs impact the mismatch between the finite and infinite trials formulations. Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.

Paperid:260 spotlight poster

Authors:Yi-Rui Yang, Chang-Wei Shi, Wu-Jun Li

Title: On the Tension between Byzantine Robustness and No-Attack Accuracy in Distributed Learning

Abstract:
Byzantine-robust distributed learning (BRDL), which refers to distributed learning that can work with potential faulty or malicious workers (also known as Byzantine workers), has recently attracted much research attention. Robust aggregators are widely used in existing BRDL methods to obtain robustness against Byzantine workers. However, Byzantine workers do not always exist in applications. As far as we know, there is almost no existing work theoretically investigating the effect of using robust aggregators when there are no Byzantine workers. To bridge this knowledge gap, we theoretically analyze the aggregation error for robust aggregators when there are no Byzantine workers. Specifically, we show that the worst-case aggregation error without Byzantine workers increases with the increase of the number of Byzantine workers that a robust aggregator can tolerate. The theoretical result reveals the tension between Byzantine robustness and no-attack accuracy, which refers to accuracy without faulty workers and malicious workers in this paper. Furthermore, we provide lower bounds for the convergence rate of gradient descent with robust aggregators for non-convex objective functions and objective functions that satisfy the Polyak-Lojasiewicz (PL) condition, respectively. We also prove the tightness of the lower bounds. The lower bounds for convergence rate reveal similar tension between Byzantine robustness and no-attack accuracy. Empirical results further support our theoretical findings.

Paperid:261 spotlight poster

Authors:Linxi Zhao, Yihe Deng, Weitong Zhang, Quanquan Gu

Title: Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

Abstract:
The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs to rectify the outputs of LVLMs. However, these approaches require either costly training or fine-tuning, or API access to proprietary LLMs for post-generation correction. In response to these limitations, we propose Mitigating hallucinAtion via image-gRounded guIdaNcE (MARINE), a framework that is both training-free and API-free. MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs. This is achieved by leveraging open-source vision models to extract object-level information, thereby enhancing the precision of LVLM-generated content. Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance. Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs' generations. We release our code athttps://github.com/Linxi-ZHAO/MARINE.

Paperid:262 spotlight poster

Authors:Dongzhe Zheng, Wenjie Mei

Title: Learning Dynamics under Environmental Constraints via Measurement-Induced Bundle Structures

Abstract:
Learning unknown dynamics under environmental (or external) constraints is fundamental to many fields (e.g., modern robotics), particularly challenging when constraint information is only locally available and uncertain. Existing approaches requiring global constraints or using probabilistic filtering fail to fully exploit the geometric structure inherent in local measurements (by using, e.g., sensors) and constraints. This paper presents a geometric framework unifying measurements, constraints, and dynamics learning through a fiber bundle structure over the state space. This naturally induced geometric structure enables measurement-aware Control Barrier Functions that adapt to local sensing (or measurement) conditions. By integrating Neural ODEs, our framework learns continuous-time dynamics while preserving geometric constraints, with theoretical guarantees of learning convergence and constraint satisfaction dependent on sensing quality. The geometric framework not only enables efficient dynamics learning but also suggests promising directions for integration with reinforcement learning approaches. Extensive simulations demonstrate significant improvements in both learning efficiency and constraint satisfaction over traditional methods, especially under limited and uncertain sensing conditions.

Paperid:263 spotlight poster

Authors:Yichen Li, Yuying Wang, Haozhao Wang, Yining Qi, Tianzhe Xiao, Ruixuan Li

Title: FedSSI: Rehearsal-Free Continual Federated Learning with Synergistic Synaptic Intelligence

Abstract:
Continual Federated Learning (CFL) allows distributed devices to collaboratively learn novel concepts from continuously shifting training data while avoiding \textit{knowledge forgetting} of previously seen tasks. To tackle this challenge, most current CFL approaches rely on extensive rehearsal of previous data. Despite effectiveness, rehearsal comes at a cost to memory, and it may also violate data privacy. Considering these, we seek to apply regularization techniques to CFL by considering their cost-efficient properties that do not require sample caching or rehearsal. Specifically, we first apply traditional regularization techniques to CFL and observe that existing regularization techniques, especially synaptic intelligence, can achieve promising results under homogeneous data distribution but fail when the data is heterogeneous. Based on this observation, we propose a simple yet effective regularization algorithm for CFL named \textbf{FedSSI}, which tailors the synaptic intelligence for the CFL with heterogeneous data settings. FedSSI can not only reduce computational overhead without rehearsal but also address the data heterogeneity issue. Extensive experiments show that FedSSI achieves superior performance compared to state-of-the-art methods.

Paperid:264 spotlight poster

Authors:Gwanhyeong Koo, Sunjae Yoon, Younghwan Lee, Ji Woo Hong, Chang D. Yoo

Title: FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields

Abstract:
Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.

Paperid:265 spotlight poster

Authors:Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji

Title: Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

Abstract:
In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of"reconstruction error explosion"in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70% sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50%, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively. Code is available athttps://github.com/wzhuang-xmu/ATP.

Paperid:266 spotlight poster

Authors:Aditya Gorla, Ryan Wang, Zhengtong Liu, Ulzee An, Sriram Sankararaman

Title: CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Abstract:
We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features �?captured by column names and text descriptions �?to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods �? an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) �?across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Paperid:267 spotlight poster

Authors:Jeonghye Kim, Yongjae Shin, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngchul Sung, Kanghoon Lee, Woohyung Lim

Title: Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Abstract:
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

Paperid:268 spotlight poster

Authors:Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite

Title: Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Abstract:
Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability---which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability---can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with thelookup-table mechanismfor factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models. We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.

Paperid:269 spotlight poster

Authors:Zhenghai Xue, Lang Feng, Jiacheng Xu, Kang Kang, Xiang Wen, Bo An, Shuicheng YAN

Title: Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learning

Abstract:
To learn from data collected in diverse dynamics, Imitation from Observation (IfO) methods leverage expert state trajectories based on the premise that recovering expert state distributions in other dynamics facilitates policy learning in the current one. However, Imitation Learning inherently imposes a performance upper bound of learned policies. Additionally, as the environment dynamics change, certain expert states may become inaccessible, rendering their distributions less valuable for imitation. To address this, we propose a novel framework that integrates reward maximization with IfO, employing F-distance regularized policy optimization. This framework enforces constraints on globally accessible states—those with nonzero visitation frequency across all considered dynamics—mitigating the challenge posed by inaccessible states. By instantiating F-distance in different ways, we derive two theoretical analysis and develop a practical algorithm called Accessible State Oriented Policy Regularization (ASOR). ASOR serves as a general-purpose module that can be incorporated into various RL approaches, including offline RL and off-policy RL. Extensive experiments across multiple benchmarks demonstrate ASOR’s effectiveness in enhancing state-of-the-art cross-domain policy transfer algorithms, significantly improving their performance.

Paperid:270 spotlight poster

Authors:Zheyang Xiong, Ziyang Cai, John Cooper, Albert Ge, Vasilis Papageorgiou, Zack Sifakis, Angeliki Giannou, Ziqian Lin, Liu Yang, Saurabh Agarwal, Grigorios Chrysos, Samet Oymak, Kangwook Lee, Dimitris Papailiopoulos

Title: Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

Abstract:
Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.

Paperid:271 spotlight poster

Authors:Taj Jones-McCormick, Aukosh Jagannath, Subhabrata Sen

Title: Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

Abstract:
Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.

Paperid:272 spotlight poster

Authors:Bo-Han Lai, Pin-Han Huang, Bo-Han Kung, Shang-Tse Chen

Title: Enhancing Certified Robustness via Block Reflector Orthogonal Layers and Logit Annealing Loss

Abstract:
Lipschitz neural networks are well-known for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet �?a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines. The implementation is available atGitHub Link.

Paperid:273 spotlight poster

Authors:Zhicheng Zhang, Wuyou Xia, Chenxi Zhao, Zhou Yan, Xiaoqiang Liu, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di ZHANG, Jufeng Yang

Title: MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding

Abstract:
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.

Paperid:274 spotlight poster

Authors:Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye HAO, Liqiang Nie

Title: STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization

Abstract:
Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation.Existing approaches mainly leverage latent variable models, e.g., VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we presentSkillTraining withAugmentedRotation (STAR), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ).It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions.Further, to capture the casual relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation.Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12% improvement over the baselines.

Paperid:275 spotlight poster

Authors:Xiyuan Wei, Ming Lin, Fanjiang Ye, Fengguang Song, Liangliang Cao, My T. Thai, Tianbao Yang

Title: Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws

Abstract:
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, namedmodel steering. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering calledDRRho risk minimization, which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches. Code is released atgithub.com/Optimization-AI/DRRho-CLIP

Paperid:276 spotlight poster

Authors:Terje Mildner, Oliver Hamelijnck, Paris Giampouras, Theodoros Damoulas

Title: Federated Generalised Variational Inference: A Robust Probabilistic Federated Learning Framework

Abstract:
We introduce FedGVI, a probabilistic Federated Learning (FL) framework that is robust to both prior and likelihood misspecification. FedGVI addresses limitations in both frequentist and Bayesian FL by providing unbiased predictions under model misspecification, with calibrated uncertainty quantification. Our approach generalises previous FL approaches, specifically Partitioned Variational Inference (Ashman et al., 2022), by allowing robust and conjugate updates, decreasing computational complexity at the clients. We offer theoretical analysis in terms of fixed-point convergence, optimality of the cavity distribution, and provable robustness to likelihood misspecification. Further, we empirically demonstrate the effectiveness of FedGVI in terms of improved robustness and predictive performance on multiple synthetic and real world classification data sets.

Paperid:277 spotlight poster

Authors:Hjalmar Wijk, Tao Roa Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua M Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Jun Koba Sato, William Saunders, Maksym Taran, et al. (2 additional authors not shown)

Title: RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Abstract:
Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4× higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2× the score of the top AI agent when both are given 32 total hours (across different attempts).

Paperid:278 spotlight poster

Authors:Jinyu Cai, Yunhe Zhang, Fusheng Liu, See-Kiong Ng

Title: Leveraging Diffusion Model as Pseudo-Anomalous Graph Generator for Graph-Level Anomaly Detection

Abstract:
A fundamental challenge in graph-level anomaly detection (GLAD) is the scarcity of anomalous graph data, as the training dataset typically contains only normal graphs or very few anomalies. This imbalance hinders the development of robust detection models. In this paper, we proposeAnomalousGraphDiffusion (AGDiff), a framework that explores the potential of diffusion models in generating pseudo-anomalous graphs for GLAD. Unlike existing diffusion-based methods that focus on modeling data normality, AGDiff leverages the latent diffusion framework to incorporate subtle perturbations into graph representations, thereby generating pseudo-anomalous graphs that closely resemble normal ones. By jointly training a classifier to distinguish these generated graph anomalies from normal graphs, AGDiff learns more discriminative decision boundaries. The shift from solely modeling normality to explicitly generating and learning from pseudo graph anomalies enables AGDiff to effectively identify complex anomalous patterns that other approaches might overlook. Comprehensive experimental results demonstrate that the proposed AGDiff significantly outperforms several state-of-the-art GLAD baselines.

Paperid:279 spotlight poster

Authors:Attila Szász, Balázs Bánhelyi, Márk Jelasity

Title: No Soundness in the Real World: On the Challenges of the Verification of Deployed Neural Networks

Abstract:
The ultimate goal of verification is to guarantee the safety of deployed neural networks. Here, we claim that all the state-of-the-art verifiers we are aware of fail to reach this goal. Our key insight is that theoretical soundness (bounding the full-precision output while computing with floating point) does not imply practical soundness (bounding the floating point output in a potentially stochastic environment). We prove this observation for the approaches that are currently used to achieve provable theoretical soundness, such as interval analysis and its variants. We also argue that achieving practical soundness is significantly harder computationally. We support our claims empirically as well by evaluating several well-known verification methods. To mislead the verifiers, we create adversarial networks that detect and exploit features of the deployment environment, such as the order and precision of floating point operations. We demonstrate that all the tested verifiers are vulnerable to our new deployment-specific attacks, which proves that they are not practically sound.

Paperid:280 spotlight poster

Authors:Tinglin Huang, Tianyu Liu, Mehrtash Babadi, Wengong Jin, Rex Ying

Title: Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching

Abstract:
Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.

Paperid:281 spotlight poster

Authors:Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal

Title: Emergence and Effectiveness of Task Vectors in In-Context Learning: An Encoder Decoder Perspective

Abstract:
Autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. Prior works have shown that transformers represent the ICL tasks as vectors in their representations. In this paper, we leverage the encoding-decoding framework to study how transformers form task vectors during pretraining and how their task encoding quality predicts ICL task performance. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of task encoding and decoding. As the model learns to encode different latent tasks (e.g., "Finding the first noun in a sentence.") into distinct, separable representations, it concurrently builds conditional decoding algorithms and improves its ICL performance. We validate this phenomenon across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B) and over the course of pretraining in OLMo-7B. Further, we demonstrate that the quality of task encoding inferred from representations predicts ICL performance, and that, surprisingly, finetuning the earlier layers can improve the task encoding and performance more than finetuning the latter layers. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.

Paperid:282 spotlight poster

Authors:Xingjian Wu, Xiangfei Qiu, Hongfan Gao, Jilin Hu, Bin Yang, Chenjuan Guo

Title: K2VAE: A Koopman-Kalman Enhanced Variational AutoEncoder for Probabilistic Time Series Forecasting

Abstract:
Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decision-making across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce $K^2$VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that $K^2$VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.

Paperid:283 spotlight poster

Authors:Linjian Meng, Wubing Chen, Wenbin Li, Tianpei Yang, Youzhi Zhang, Yang Gao

Title: Reducing Variance of Stochastic Optimization for Approximating Nash Equilibria in Normal-Form Games

Abstract:
Nash equilibrium (NE) plays an important role in game theory. How to efficiently compute an NE in NFGs is challenging due to its complexity and non-convex optimization property. Machine Learning (ML), the cornerstone of modern artificial intelligence, has demonstrated remarkable empirical performance across various applications including non-convex optimization. To leverage non-convex stochastic optimization techniques from ML for approximating an NE, various loss functions have been proposed. Among these, only one loss function is unbiased, allowing for unbiased estimation under the sampled play. Unfortunately, this loss function suffers from high variance, which degrades the convergence rate. To improve the convergence rate by mitigating the high variance associated with the existing unbiased loss function, we propose a novel surrogate loss function named Nash Advantage Loss (NAL). NAL is theoretically proved unbiased and exhibits significantly lower variance than the existing unbiased loss function. Experimental results demonstrate that the algorithm minimizing NAL achieves a significantly faster empirical convergence rates compared to other algorithms, while also reducing the variance of estimated loss value by several orders of magnitude.

Paperid:284 spotlight poster

Authors:Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, Daniel Kang

Title: CVE-Bench: A Benchmark for AI Agents�?Ability to Exploit Real-World Web Application Vulnerabilities

Abstract:
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture-the-Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized exper- tise to reproduce exploits and a systematic approach to evaluating unpredictable attacks. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our experiments show that the state-of-the-art agent framework can exploit up to 13% of the vulnerabilities.

Paperid:285 spotlight poster

Authors:Zhangyi Liu, Feng Liu, Rui Gao, Shuang Li

Title: Convergence of Mean-Field Langevin Stochastic Descent-Ascent for Distributional Minimax Optimization

Abstract:
We study convergence properties of the discrete-time Mean-Field Langevin Stochastic Descent-Ascent (MFL-SDA) algorithm for solving distributional minimax optimization. These problems arise in various applications, such as zero-sum games, generative adversarial networks and distributionally robust learning. Despite the significance of MFL-SDA in these contexts, the discrete-time convergence rate remains underexplored. To address this gap, we establish a last-iterate convergence rate of $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$ for MFL-SDA. This rate is nearly optimal when compared to the complexity lower bound of its Euclidean counterpart. This rate also matches the complexity of mean-field Langevin stochastic gradient descent for distributional minimization and the outer-loop iteration complexity of an existing double-loop algorithm for distributional minimax problems. By leveraging an elementary analysis framework that avoids PDE-based techniques, we overcome previous limitations and achieve a faster convergence rate.

Paperid:286 spotlight poster

Authors:Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Dae Houlihan, Soroush Vosoughi, Shao-Yuan Lo

Title: Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

Abstract:
Theory-of-mind (ToM) enables humans to infer mental states—such as beliefs, desires, and intentions—forming the foundation of social cognition. Existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning but struggle with scalability in multimodal environments. They remain trapped within the gravitational pull of multi-step planning complexity, failing to generalize as task demands increase. To overcome these limitations, we propose a scalable Bayesian ToM planner. It breaks down ToM complexity into stepwise Bayesian updates. Meanwhile, weak-to-strong control specializes smaller LMs to refine ToM-specific likelihood estimation, transferring their ToM reasoning behavior to larger LMs (7B to 405B) for social and world knowledge integration. This synergistic approach enables scalability, aligning large-model inference with human mental states with Bayesian principles. Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal ToM benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments.

Paperid:287 spotlight poster

Authors:Xihong Yang, Siwei Wang, Fangdi Wang, Jiaqi Jin, Suyuan Liu, Yue Liu, En Zhu, Xinwang Liu, Yueming Jin

Title: Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios

Abstract:
Leveraging the powerful representation learning capabilities, deep multi-view clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios. The code of AIRMVC are available athttps://github.com/xihongyang1999/AIRMVCon Github.

Paperid:288 spotlight poster

Authors:Dongjae Jeon, Dueun Kim, Albert No

Title: Understanding and Mitigating Memorization in Generative Models via Sharpness of Probability Landscapes

Abstract:
In this paper, we introduce a geometric framework to analyze memorization in diffusion models through the sharpness of the log probability density. We mathematically justify a previously proposed score-difference-based memorization metric by demonstrating its effectiveness in quantifying sharpness. Additionally, we propose a novel memorization metric that captures sharpness at the initial stage of image generation in latent diffusion models, offering early insights into potential memorization. Leveraging this metric, we develop a mitigation strategy that optimizes the initial noise of the generation process using a sharpness-aware regularization term.

Paperid:289 spotlight poster

Authors:Saketh Bachu, Erfan Shayegani, Rohit Lal, Trishna Chakraborty, Arindam Dutta, Chengyu Song, Yue Dong, Nael B. Abu-Ghazaleh, Amit Roy-Chowdhury

Title: Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models

Abstract:
Vision-language models (VLMs) have improved significantly in their capabilities, but their complex architecture makes their safety alignment challenging. In this paper, we reveal an uneven distribution of harmful information across the intermediate layers of the image encoder and show that skipping a certain set of layers and exiting early can increase the chance of the VLM generating harmful responses. We call it as “Image enCoder Early-exiT�?based vulnerability (ICET). Our experiments across three VLMs: LLaVA-1.5, LLaVA-NeXT, and Llama 3.2 show that performing early exits from the image encoder significantly increases the likelihood of generating harmful outputs. To tackle this, we propose a simple yet effective modification of the Clipped-Proximal Policy Optimization (Clip-PPO) algorithm for performing layer-wise multi-modal RLHF for VLMs. We term this as Layer-Wise PPO (L-PPO). We evaluate our L-PPO algorithm across three multi-modal datasets and show that it consistently reduces the harmfulness caused by early exits.

Paperid:290 spotlight poster

Authors:Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu

Title: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Abstract:
Failure attribution in LLM multi-agent systems—identifying the agent and step responsible for task failures—provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available inhttps://github.com/mingyin1/Agents_Failure_Attribution.

Paperid:291 spotlight poster

Authors:Jan Schuchardt, Mina Dalirrooyfard, Jed Guzelkabaagac, Anderson Schneider, Yuriy Nevmyvaka, Stephan Günnemann

Title: Privacy Amplification by Structured Subsampling for Deep Differentially Private Time Series Forecasting

Abstract:
Many forms of sensitive data, such as web traffic, mobility data, or hospital occupancy, are inherently sequential. The standard method for training machine learning models while ensuring privacy for units of sensitive information, such as individual hospital visits, is differentially private stochastic gradient descent (DP-SGD). However, we observe in this work that the formal guarantees of DP-SGD are incompatible with time series specific tasks like forecasting, since they rely on theprivacy amplificationattained by training on small, unstructured batches sampled from an unstructured dataset. In contrast, batches for forecasting are generated by (1) sampling sequentially structured time series from a dataset, (2) sampling contiguous subsequences from these series, and (3) partitioning them into context and ground-truth forecast windows. We theoretically analyze the privacy amplification attained by thisstructured subsamplingto enable the training of forecasting models with sound and tight event- and user-level privacy guarantees. Towards more private models, we additionally prove how data augmentation amplifies privacy in self-supervised training of sequence models. Our empirical evaluation demonstrates that amplification by structured subsampling enables the training of forecasting models with strong formal privacy guarantees.

Paperid:292 spotlight poster

Authors:Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop yoon, Jin Sung Kim, Kyungsang Kim, Xiang Li, Quanzheng Li

Title: Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective

Abstract:
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code is available athttps://github.com/tvseg/dMoE.

Paperid:293 spotlight poster

Authors:Ulzee An, Moonseong Jeong, Simon Austin Lee, Aditya Gorla, Yuzhe Yang, Sriram Sankararaman

Title: Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leveraging Pretrained 2D Foundation Models

Abstract:
Current challenges in developing foundational models for volumetric imaging data, such as magnetic resonance imaging (MRI), stem from the computational complexity of state-of-the-art architectures in high dimensions and curating sufficiently large datasets of volumes. To address these challenges, we introduce Raptor (Random Planar Tensor Reduction), a train-free method for generating semantically rich embeddings for volumetric data. Raptor leverages a frozen 2D foundation model, pretrained on natural images, to extract visual tokens from individual cross-sections of medical volumes. These tokens are then spatially compressed using random projections, significantly reducing computational complexity while retaining rich semantic information. Extensive experiments on 10 diverse medical volume tasks verify the superior performance of Raptor over state-of-the-art methods, including those pretrained exclusively on medical volumes (+3 SuPreM, +6 MISFM, +10 Merlin, +13 VoCo, and +14 SLIViT), while entirely bypassing the need for costly training. Our results highlight Raptor's effectiveness and versatility as a foundation for advancing deep learning-based methods for medical volumes (code: github.com/sriramlab/raptor).

Paperid:294 spotlight poster

Authors:Ruichuan Huang, Jiawei Zhang, Ahmet Alacaoglu

Title: Stochastic Smoothed Primal-Dual Algorithms for Nonconvex Optimization with Linear Inequality Constraints

Abstract:
We propose smoothed primal-dual algorithms for solving stochastic nonconvex optimization problems with linear \emph{inequality} constraints. Our algorithms are single-loop and only require a single (or two) samples of stochastic gradients at each iteration. A defining feature of our algorithm is that it is based on an inexact gradient descent framework for the Moreau envelope, where the gradient of the Moreau envelope is estimated using one step of a stochastic primal-dual (linearized) augmented Lagrangian algorithm. To handle inequality constraints and stochasticity, we combine the recently established global error bounds in constrained optimization with a Moreau envelope-based analysis of stochastic proximal algorithms. We establish the optimal (in their respective cases) $O(\varepsilon^{-4})$ and $O(\varepsilon^{-3})$ sample complexity guarantees for our algorithms and provide extensions to stochastic linear constraints. Unlike existing methods, iterations of our algorithms are free of subproblems, large batch sizes or increasing penalty parameters in their iterations and they use dual variable updates to ensure feasibility.

Paperid:295 spotlight poster

Authors:Fabiola Ricci, Lorenzo Bardone, Sebastian Goldt

Title: Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions

Abstract:
Deep neural networks learn structured features from complex, non-Gaussian inputs, but the mechanisms behind this process remain poorly understood. Our work is motivated by the observation that the first-layer filters learnt by deep convolutional neural networks from natural images resemble those learnt by independent component analysis (ICA), a simple unsupervised method that seeks the most non-Gaussian projections of its inputs. This similarity suggests that ICA provides a simple, yet principled model for studying feature learning. Here, we leverage this connection to investigate the interplay between data structure and optimisation in feature learning for the most popular ICA algorithm, FastICA, and stochastic gradient descent (SGD), which is used to train deep networks. We rigorously establish that FastICA requires at least $n\gtrsim d^4$ samples to recover a single non-Gaussian direction from $d$-dimensional inputs on a simple synthetic data model. We show that vanilla online SGD outperforms FastICA, and prove that the optimal sample complexity $n\gtrsim d^2$ can be reached by smoothing the loss, albeit in a data-dependent way. We finally demonstrate the existence of a search phase for FastICA on ImageNet, and discuss how the strong non-Gaussianity of said images compensates for the poor sample complexity of FastICA.

Paperid:296 spotlight poster

Authors:Dong Xiao, Guangyao Chen, Peixi Peng, Yangru Huang, Yifan Zhao, Yongxing Dai, Yonghong Tian

Title: When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network

Abstract:
Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.

Paperid:297 spotlight poster

Authors:Yu Xin Li, Felix Dangel, Derek Tam, Colin Raffel

Title: Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

Abstract:
The diagonal of a model's Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher is estimated by computing the squared gradient of the model's outputs with respect to its parameters, averaged over a few hundred or thousand examples �?a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (Squared gradient accumulator as an approximation of the Fisher) consistently performs similarly to the Fisher while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.

Paperid:298 spotlight poster

Authors:Lisha Chen, Quan Xiao, Ellen Hidemi Fukuda, Xinyi Chen, Kun Yuan, Tianyi Chen

Title: Efficient First-Order Optimization on the Pareto Set for Multi-Objective Learning under Preference Guidance

Abstract:
Multi-objective learning under user-specified preference is common in real-world problems such as multi-lingual speech recognition under fairness. In this work, we frame such a problem as a semivectorial bilevel optimization problem, whose goal is to optimize a pre-defined preference function, subject to the constraint that the model parameters are weakly Pareto optimal. To solve this problem, we convert the multi-objective constraints to a single-objective constraint through a merit function with an easy-to-evaluate gradient, and then, we use a penalty-based reformulation of the bilevel optimization problem. We theoretically establish the properties of the merit function, and the relations of solutions for the penalty reformulation and the constrained formulation. Then we propose algorithms to solve the reformulated single-level problem, and establish its convergence guarantees. We test the method on various synthetic and real-world problems. The results demonstrate the effectiveness of the proposed method in finding preference-guided optimal solutions to the multi-objective problem.

Paperid:299 spotlight poster

Authors:Haotian Wu, Gongpu Chen, Pier Luigi Dragotti, Deniz Gunduz

Title: LotteryCodec: Searching the Implicit Representation in a Random Network for Low-Complexity Image Compression

Abstract:
We introduce and validate the lottery codec hypothesis, which states that untrained subnetworks within randomly initialized networks can serve as synthesis networks for overfitted image compression, achieving rate-distortion (RD) performance comparable to trained networks. This hypothesis leads to a new paradigm for image compression by encoding image statistics into the network substructure. Building on this hypothesis, we propose LotteryCodec, which overfits a binary mask to an individual image, leveraging an over-parameterized and randomly initialized network shared by the encoder and the decoder. To address over-parameterization challenges and streamline subnetwork search, we develop a rewind modulation mechanism that improves the RD performance. LotteryCodec outperforms VTM and sets a new state-of-the-art in single-image compression. LotteryCodec also enables adaptive decoding complexity through adjustable mask ratios, offering flexible compression solutions for diverse device constraints and application requirements.

Paperid:300 spotlight poster

Authors:Qinglin Zhu, Runcong Zhao, Hanqi Yan, Yulan He, Yudong Chen, Lin Gui

Title: Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration

Abstract:
Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution.

Paperid:301 spotlight poster

Authors:Henry Moss, Sebastian W. Ober, Tom Diethe

Title: Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces

Abstract:
Bayesian optimisation in the latent space of a VAE is a powerful framework for optimisation tasks over complex structured domains, such as the space of valid molecules. However, existing approaches tightly couple the surrogate and generative models, which can lead to suboptimal performance when the latent space is not tailored to specific tasks, which in turn has led to the proposal of increasingly sophisticated algorithms. In this work, we explore a new direction, instead proposing a decoupled approach that trains a generative model and a GP surrogate separately, then combines them via a simple yet principled Bayesian update rule. This separation allows each component to focus on its strengths�?structure generation from the VAE and predictive modelling by the GP. We show that our decoupled approach improves our ability to identify high-potential candidates in molecular optimisation problems under constrained evaluation budgets.

Paperid:302 spotlight poster

Authors:Charita Dellaporta, Patrick O'Hara, Theodoros Damoulas

Title: Decision Making under the Exponential Family: Distributionally Robust Optimisation with Bayesian Ambiguity Sets

Abstract:
Decision making under uncertainty is challenging as the data-generating process (DGP) is often unknown. Bayesian inference proceeds by estimating the DGP through posterior beliefs on the model’s parameters. However, minimising the expected risk under these beliefs can lead to suboptimal decisions due to model uncertainty or limited, noisy observations. To address this, we introduce Distributionally Robust Optimisation with Bayesian Ambiguity Sets (DRO-BAS) which hedges against model uncertainty by optimising the worst-case risk over a posterior-informed ambiguity set. We provide two such sets, based on the posterior expectation (DRO-BAS(PE)) or the posterior predictive (DRO-BAS(PP)) and prove that both admit, under conditions, strong dual formulations leading to efficient single-stage stochastic programs which are solved with a sample average approximation. For DRO-BAS(PE), this covers all conjugate exponential family members while for DRO-BAS(PP) this is shown under conditions on the predictive's moment generating function. Our DRO-BAS formulations outperform existing Bayesian DRO on the Newsvendor problem and achieve faster solve times with comparable robustness on the Portfolio problem.

Paperid:303 spotlight poster

Authors:Avery Ma, Yangchen Pan, Amir-massoud Farahmand

Title: PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling

Abstract:
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question–answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.

Paperid:304 spotlight poster

Authors:Liyuan Liang, Xinyi Chen, Gan Luo, Kun Yuan

Title: Achieving Linear Speedup and Near-Optimal Complexity for Decentralized Optimization over Row-stochastic Networks

Abstract:
A key challenge in decentralized optimization is determining the optimal convergence rate and designing algorithms to achieve it. While this problem has been extensively addressed for doubly-stochastic and column-stochastic mixing matrices, the row-stochastic scenario remains unexplored. This paper bridges this gap by introducing effective metrics to capture the influence of row-stochastic mixing matrices and establishing the first convergence lower bound for decentralized learning over row-stochastic networks. However, existing algorithms fail to attain this lower bound due to two key issues: deviation in the descent direction caused by the adapted gradient tracking (GT) and instability introduced by the Pull-Diag protocol. To address descent deviation, we propose a novel analysis framework demonstrating that Pull-Diag-GT achieves linear speedup—the first such result for row-stochastic decentralized optimization. Moreover, by incorporating a multi-step gossip (MG) protocol, we resolve the instability issue and attain the lower bound, achieving near-optimal complexity for decentralized optimization over row-stochastic networks.

Paperid:305 spotlight poster

Authors:Yinhong Liu, Zhijiang Guo, Tianya Liang, Ehsan Shareghi, Ivan Vulić, Nigel Collier

Title: Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models

Abstract:
Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decision-making systems. Yet current LLMs often show inconsistencies in their judgments. In this work, we examine \textit{logical preference consistency} as a foundational requirement for building more dependable LLM systems, ensuring stable and coherent decision-making while minimizing erratic or contradictory outputs. To quantify the logical preference consistency, we propose a universal evaluation framework based on three fundamental properties:transitivity,commutativityandnegation invariance. Through extensive experimentation across diverse LLMs, we demonstrate that these properties serve as strong indicators of judgment robustness. Furthermore, we introduce a data refinement and augmentation technique, REPAIR, that enhances logical consistency while maintaining alignment with human preferences. Finally, we show that improving consistency leads to better performance in LLM-driven logic-based algorithms, reinforcing stability and coherence in decision-making systems.

Paperid:306 spotlight poster

Authors:Hong-Ming Chiu, Hao Chen, Huan Zhang, Richard Y. Zhang

Title: SDP-CROWN: Efficient Bound Propagation for Neural Network Verification with Tightness of Semidefinite Programming

Abstract:
Neural network verifiers based on linear bound propagation scale impressively to massive models but can be surprisingly loose when neuron coupling is crucial. Conversely, semidefinite programming (SDP) verifiers capture inter-neuron coupling naturally, but their cubic complexity restricts them to only small models. In this paper, we propose SDP-CROWN, a novel hybrid verification framework that combines the tightness of SDP relaxations with the scalability of bound-propagation verifiers. At the core of SDP-CROWN is a new linear bound---derived via SDP principles---that explicitly captures $\ell_{2}$-norm-based inter-neuron coupling while adding only one extra parameter per layer. This bound can be integrated seamlessly into any linear bound-propagation pipeline, preserving the inherent scalability of such methods yet significantly improving tightness. In theory, we prove that our inter-neuron bound can be up to a factor of $\sqrt{n}$ tighter than traditional per-neuron bounds. In practice, when incorporated into the state-of-the-art $\alpha$-CROWN verifier, we observe markedly improved verification performance on large models with up to 65 thousand neurons and 2.47 million parameters, achieving tightness that approaches that of costly SDP-based methods.

Paperid:307 spotlight poster

Authors:Xiuyuan Wang, Chaochao Chen, Weiming Liu, Xinting Liao, Fan Wang, Xiaolin Zheng

Title: Efficient Source-free Unlearning via Energy-Guided Data Synthesis and Discrimination-Aware Multitask Optimization

Abstract:
With growing privacy concerns and the enforcement of data protection regulations, machine unlearning has emerged as a promising approach for removing the influence of forget data while maintaining model performance on retain data. However, most existing unlearning methods require access to the original training data, which is often impractical due to privacy policies, storage constraints, and other limitations. This gives rise to the challenging task of source-free unlearning, where unlearning must be accomplished without accessing the original training data. Few existing source-free unlearning methods rely on knowledge distillation and model retraining, which impose substantial computational costs. In this work, we propose the Data Synthesis-based Discrimination-Aware (DSDA) unlearning framework, which enables efficient source-free unlearning in two stages: (1) Accelerated Energy-Guided Data Synthesis (AEGDS), which employs Langevin dynamics to model the training data distribution while integrating Runge–Kutta methods and momentum to enhance efficiency. (2) Discrimination-Aware Multitask Optimization (DAMO), which refines the feature distribution of retain data and mitigates the gradient conflicts among multiple unlearning objectives. Extensive experiments on three benchmark datasets demonstrate that DSDA outperforms existing unlearning methods, validating its effectiveness and efficiency in source-free unlearning.

Paperid:308 spotlight poster

Authors:Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

Title: Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

Abstract:
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512$\times$512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.

Paperid:309 spotlight poster

Authors:Shayan Kiyani, George J. Pappas, Aaron Roth, Hamed Hassani

Title: Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents

Abstract:
A fundamental question in data-driven decision making is how to quantify the uncertainty of predictions to inform risk-sensitive downstream actions, as often required in domains such as medicine. We develop a decision-theoretic foundation linking prediction sets to risk-averse decision-making, addressing three questions: (1) What is the correct notion of uncertainty quantification for risk-averse decision makers? We prove that prediction sets are optimal for decision makers who wish to optimize their value at risk. (2) What is the optimal policy that a risk averse decision maker should use to map prediction sets to actions? We show that a simple max-min decision policy is optimal for risk-averse decision makers. Finally, (3) How can we derive prediction sets that are optimal for such decision makers? We provide an exact characterization in the population regime and a distribution free finite-sample construction. These insights leads toRisk-Averse Calibration (RAC), a principled algorithm that is bothpractical—exploiting black-box predictions to enhance downstream utility—andsafe—adhering to user-defined risk thresholds. We experimentally demonstrate RAC's advantages in medical diagnosis and recommendation systems, showing that it substantially improves the trade-off between safety and utility, delivering higher utility than existing methods while avoiding critical errors.

Paperid:310 spotlight poster

Authors:Jianqing Zhang, Yang Liu, JIE FU, Yang Hua, Tianyuan Zou, Jian Cao, Qiang Yang

Title: PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative APIs

Abstract:
The rise of generative APIs has fueled interest in privacy-preserving synthetic data generation. While the Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs, it struggles with few-shot private data due to the limitations of its DP-protected similarity voting approach. In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry. To address this challenge, we propose a novel API-assisted algorithm, Private Contrastive Evolution (PCEvolve), which iteratively mines inherent inter-class contrastive relationships in few-shot private data beyond individual data points and seamlessly integrates them into an adapted Exponential Mechanism (EM) to optimize DP’s utility in an evolution loop. We conduct extensive experiments on four specialized datasets, demonstrating that PCEvolve outperforms PE and other API-assisted baselines. These results highlight the potential of leveraging API access with private data for quality evaluation, enabling the generation of high-quality DP synthetic images and paving the way for more accessible and effective privacy-preserving generative API applications. Our code is available athttps://github.com/TsingZ0/PCEvolve.

Paperid:311 spotlight poster

Authors:Fan Li, Xuan Wang, Min Qi, Zhaoxiang Zhang, yuelei xu

Title: Better to Teach than to Give: Domain Generalized Semantic Segmentation via Agent Queries with Diffusion Model Guidance

Abstract:
Domain Generalized Semantic Segmentation (DGSS) trains a model on a labeled source domain to generalize to unseen target domains with consistent contextual distribution and varying visual appearance. Most existing methods rely on domain randomization or data generation but struggle to capture the underlying scene distribution, resulting in the loss of useful semantic information. Inspired by the diffusion model's capability to generate diverse variations within a given scene context, we consider harnessing its rich prior knowledge of scene distribution to tackle the challenging DGSS task. In this paper, we propose a novel agent \textbf{Query}-driven learning framework based on \textbf{Diff}usion model guidance for DGSS, named QueryDiff. Our recipe comprises three key ingredients: (1) generating agent queries from segmentation features to aggregate semantic information about instances within the scene; (2) learning the inherent semantic distribution of the scene through agent queries guided by diffusion features; (3) refining segmentation features using optimized agent queries for robust mask predictions. Extensive experiments across various settings demonstrate that our method significantly outperforms previous state-of-the-art methods. Notably, it enhances the model's ability to generalize effectively to extreme domains, such as cubist art styles. Code is available athttps://github.com/FanLiHub/QueryDiff.

Paperid:312 spotlight poster

Authors:Tejas Jayashankar, Jongha Jon Ryu, Gregory W. Wornell

Title: Score-of-Mixture Training: One-Step Generative Model Training Made Simple via Score Estimation of Mixture Distributions

Abstract:
We proposeScore-of-Mixture Training(SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the $\alpha$-skew Jensen–Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels. Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we callScore-of-Mixture Distillation(SMD). It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64×64 show that SMT/SMD are competitive with and can even outperform existing methods.

Paperid:313 spotlight poster

Authors:Thomas Pouplin, Kasia Kobalczyk, Hao Sun, Mihaela van der Schaar

Title: The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data

Abstract:
Developing autonomous agents capable of performing complex, multi-step decision-making tasks specified in natural language remains a significant challenge, particularly in realistic settings where labeled data is scarce and real-time experimentation is impractical. Existing reinforcement learning (RL) approaches often struggle to generalize to unseen goals and states, limiting their applicability. In this paper, we introduceTEDUO, a novel training pipeline for offline language-conditioned policy learning in symbolic environments. Unlike conventional methods,TEDUOoperates on readily available, unlabeled datasets and addresses the challenge of generalization to previously unseen goals and states. Our approach harnesses large language models (LLMs) in a dual capacity: first, as automatization tools augmenting offline datasets with richer annotations, and second, as generalizable instruction-following agents. Empirical results demonstrate thatTEDUOachieves data-efficient learning of robust language-conditioned policies, accomplishing tasks beyond the reach of conventional RL frameworks or out-of-the-box LLMs alone.

Paperid:314 spotlight poster

Authors:Zexu Sun, Qiyu Han, Hao Yang, Anpeng Wu, Minqin Zhu, Dugang Liu, Chen Ma, Yunpeng Weng, Xing Tang, xiuqiang He

Title: Invariant Deep Uplift Modeling for Incentive Assignment in Online Marketing via Probability of Necessity and Sufficiency

Abstract:
In online platforms, incentives (\textit{e.g}., discounts, coupons) are used to boost user engagement and revenue. Uplift modeling methods are developed to estimate user responses from observational data, often incorporating distribution balancing to address selection bias. However, these methods are limited by in-distribution testing data, which mirrors the training data distribution. In reality, user features change continuously due to time, geography, and other factors, especially on complex online marketing platforms. Thus, effective uplift modeling method for out-of-distribution data is crucial. To address this, we propose a novel uplift modeling method \textbf{I}nvariant \textbf{D}eep \textbf{U}plift \textbf{M}odeling, namely \textbf{IDUM}, which uses invariant learning to enhance out-of-distribution generalization by identifying causal factors that remain consistent across domains. IDUM further refines these features into necessary and sufficient factors and employs a masking component to reduce computational costs by selecting the most informative invariant features. A balancing discrepancy component is also introduced to mitigate selection bias in observational data. We conduct extensive experiments on public and real-world datasets to demonstrate IDUM's effectiveness in both in-distribution and out-of-distribution scenarios in online marketing. Furthermore, we also provide theoretical analysis and related proofs to support our IDUM's generalizability.

Paperid:315 spotlight poster

Authors:Diyuan Wu, Marco Mondelli

Title: Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime

Abstract:
Neural Collapse is a phenomenon where the last-layer representations of a well-trained neural network converge to a highly structured geometry. In this paper, we focus on its first (and most basic) property, known as NC1: the within-class variability vanishes. While prior theoretical studies establish the occurrence of NC1 via the data-agnostic unconstrained features model, our work adopts a data-specific perspective, analyzing NC1 in a three-layer neural network, with the first two layers operating in the mean-field regime and followed by a linear layer. In particular, we establish a fundamental connection between NC1 and the loss landscape: we prove that points with small empirical loss and gradient norm (thus, close to being stationary) approximately satisfy NC1, and the closeness to NC1 is controlled by the residual loss and gradient norm. We then show that (i) gradient flow on the mean squared error converges to NC1 solutions with small empirical loss, and (ii) for well-separated data distributions, both NC1 and vanishing test loss are achieved simultaneously. This aligns with the empirical observation that NC1 emerges during training while models attain near-zero test error. Overall, our results demonstrate that NC1 arises from gradient training due to the properties of the loss landscape, and they show the co-occurrence of NC1 and small test error for certain data distributions.

Paperid:316 spotlight poster

Authors:Junhao Dong, Piotr Koniusz, Yifei Zhang, Hao Zhu, Weiming Liu, Xinghua Qu, Yew-Soon Ong

Title: Improving Zero-Shot Adversarial Robustness in Vision-Language Models by Closed-form Alignment of Adversarial Path Simplices

Abstract:
Vision-Language Models (VLMs) such as CLIP excel at zero-shot classification due to large-scale pre-training but are vulnerable to adversarial examples. Adversarial fine-tuning robustifies zero-shot models by aligning prediction scores of individual adversaries with their clean counterparts, which typically overlooks intermediate adversarial samples along the adversarial trajectory crossing the decision boundary. Such intermediate adversaries and their vicinity produce informative representations capturing the decision boundary in detail. They can be improved by sampling adversarial candidates from simplices formed by joining two consecutive vertices on the adversarial trajectory and their clean counterpart. However, sampling simplices for adversaries is very costly. To train robust VLM, we overcome these limitations by Taylor expansion and formulating an upper-bound of alignment loss that depends on the Jacobian/Hessian obtained at clean samples. As regions between clean and intermediate adversarial samples capture a larger decision landscape, we robustify VLM by plausible adversaries from simplices by our closed-form formulation equivalent to infinite uniform sampling of the simplex. We obtain state-of-the-art robustness across 15 datasets and diverse vision-language tasks.

Paperid:317 spotlight poster

Authors:Tianwei Lin, Wenqiao Zhang, SIJING LI, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Song xiaohui, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi

Title: HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Abstract:
We presentHealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained Large Language Models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation(H-LoRA)technique, which is complemented by a tailored hierarchical visual perception(HVP)approach and a three-stage learning strategy(TLS). To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset calledVL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed athttps://github.com/DCDmllm/HealthGPT.

Paperid:318 spotlight poster

Authors:Xingxing Yang, Jie Chen, Zaifeng Yang

Title: PhySpec: Physically Consistent Spectral Reconstruction via Orthogonal Subspace Decomposition and Self-Supervised Meta-Auxiliary Learning

Abstract:
This paper presents a novel approach to hyperspectral image (HSI) reconstruction from RGB images, addressing fundamental limitations in existing learning-based methods from a physical perspective. We discuss and aim to address the ``colorimetric dilemma": failure to consistently reproduce ground-truth RGB from predicted HSI, thereby compromising physical integrity and reliability in practical applications. To tackle this issue, we propose PhySpec, a physically consistent framework for robust HSI reconstruction. Our approach fundamentally exploits the intrinsic physical relationship between HSIs and corresponding RGBs by employing orthogonal subspace decomposition, which enables explicit estimation of camera spectral sensitivity (CSS). This ensures that our reconstructed spectra align with well-established physical principles, enhancing their reliability and fidelity. Moreover, to efficiently use internal information from test samples, we propose a self-supervised meta-auxiliary learning (MAXL) strategy that rapidly adapts the trained parameters to unseen samples using only a few gradient descent steps at test time, while simultaneously constraining the generated HSIs to accurately recover ground-truth RGB values. Thus, MAXL reinforces the physical integrity of the reconstruction process. Extensive qualitative and quantitative evaluations validate the efficacy of our proposed framework, showing superior performance compared to SOTA methods.

Paperid:319 spotlight poster

Authors:Lukas Braun, Erin Grant, Andrew M Saxe

Title: Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks

Abstract:
A foundational principle of connectionism is that perception, action, and cognition emerge from parallel computations among simple, interconnected units that generate and rely on neural representations. Accordingly, researchers employ multivariate pattern analysis to decode and compare the neural codes of artificial and biological networks, aiming to uncover their functions. However, there is limited analytical understanding of how a network’s representation and function relate, despite this being essential to any quantitative notion of underlying function or functional similarity. We address this question using fully analysable two-layer linear networks and numerical simulations in nonlinear networks. We find that function and representation are dissociated, allowing representational similarity without functional similarity and vice versa. Further, we show that neither robustness to input noise nor the level of generalisation error constrain representations to the task. In contrast, networks robust to parameter noise have limited representational flexibility and must employ task-specific representations. Our findings suggest that representational alignment reflects computational advantages beyond functional alignment alone, with significant implications for interpreting and comparing the representations of connectionist systems

Paperid:320 spotlight poster

Authors:Yingzhen Yang

Title: Sharp Generalization for Nonparametric Regression by Over-Parameterized Neural Networks: A Distribution-Free Analysis in Spherical Covariate

Abstract:
Sharp generalization bound for neural networks trained by gradient descent (GD) is of central interest in statistical learning theory and deep learning. In this paper, we consider nonparametric regression by an over-parameterized two-layer NN trained by GD. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $O(\epsilon_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\epsilon_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions on the covariate as long as the covariate lies on the unit sphere, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions. As a special case of our general result, when the eigenvalues of the associated NTK decay at a rate of $\lambda_j \asymp j^{-\frac{d}{d-1}}$ for $j \ge 1$ which happens under certain distributional assumption such as the training features follow the spherical uniform distribution, we immediately obtain the minimax optimal rate of $O(n^{-\frac{d}{2d-1}})$, which is the major results of several existing works in this direction. The neural network width in our general result is lower bounded by a function of only $d$ and $\epsilon_n$, and such width does not depend on the minimum eigenvalue of the empirical NTK matrix whose lower bound usually requires additional assumptions on the training data. Our results are built upon two significant technical results which are of independent interest. First, uniform convergence to the NTK is established during the training process by GD, so that we can have a nice decomposition of the neural network function at any step of the GD into a function in the Reproducing Kernel Hilbert Space associated with the NTK and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD. Our result formally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions about the spherical covariate.

Paperid:321 spotlight poster

Authors:Etowah Adams, Liam Bai, Minji Lee, Yiyang Yu, Mohammed AlQuraishi

Title: From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models

Abstract:
Protein language models (pLMs) are powerful predictors of protein structure and function, learning through unsupervised training on millions of protein sequences. pLMs are thought to capture common motifs in protein sequences, but the specifics of pLM features are not well understood. Identifying these features would not only shed light on how pLMs work, but potentially uncover novel protein biology––studying the model to study the biology. Motivated by this, we train sparse autoencoders (SAEs) on the residual stream of a pLM, ESM-2. By characterizing SAE features, we determine that pLMs use a combination of generic features and family-specific features to represent a protein. In addition, we demonstrate how known sequence determinants of properties such as thermostability and subcellular localization can be identified by linear probing of SAE features. For predictive features without known functional associations, we hypothesize their role in unknown mechanisms and provide visualization tools to aid their interpretation. Our study gives a better understanding of the limitations of pLMs, and demonstrates how SAE features can be used to help generate hypotheses for biological mechanisms. We release our code, model weights, and feature visualizer.

Paperid:322 poster

Authors:Seohong Park, Qiyang Li, Sergey Levine

Title: Flow Q-Learning

Abstract:
We present flow Q-learning (FQL), a simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Training a flow policy with RL is a tricky problem, due to the iterative nature of the action generation process. We address this challenge by training an expressive one-step policy with RL, rather than directly guiding an iterative flow policy to maximize values. This way, we can completely avoid unstable recursive backpropagation, eliminate costly iterative action generation at test time, yet still mostly maintain expressivity. We experimentally show that FQL leads to strong performance across 73 challenging state- and pixel-based OGBench and D4RL tasks in offline RL and offline-to-online RL.

Paperid:323 poster

Authors:Tao Feng, Yexin Wu, Guanyu Lin, Jiaxuan You

Title: Graph World Model

Abstract:
World models (WMs) demonstrate strong capabilities in prediction, generation, and planning tasks. Existing WMs primarily focus on unstructured data while cannot leverage the ubiquitous structured data, often represented as graphs, in the digital world. While multiple graph foundation models have been proposed, they focus on graph learning tasks and cannot extend to diverse multi-modal data and interdisciplinary tasks. To address these challenges, we propose the Graph World Model (GWM), a world model that supports both unstructured and graph-structured states with multi-modal information and represents diverse tasks as actions. The core of a GWM is a generic message-passing algorithm to aggregate structured information, either over a unified multi-modal token space by converting multi-modal data into text (GWM-T) or a unified multi-modal embedding space by modality-specific encoders (GWM-E). Notably, GWM introduces action nodes to support diverse tasks, where action nodes are linked to other nodes via direct reference or similarity computation. Extensive experiments on 6 tasks from diverse domains, including multi-modal generation and matching, recommendation, graph prediction, multi-agent, retrieval-augmented generation, and planning and optimization, show that the same GWM outperforms or matches domain-specific baselines' performance, benefits from multi-hop structures, and demonstrate strong zero-shot/few-shot capabilities on unseen new tasks. Our codes for GWM is released athttps://github.com/ulab-uiuc/GWM.

Paperid:324 poster

Authors:Annabelle Michael Carrell, Albert Gong, Abhishek Shetty, Raaz Dwivedi, Lester Mackey

Title: Low-Rank Thinning

Abstract:
The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially reducing the number of summary points. However, existing guarantees cover only a restricted range of distributions and kernel-based quality measures and suffer from pessimistic dimension dependence. To address these deficiencies, we introduce a new low-rank analysis of sub-Gaussian thinning that applies to any distribution and any kernel, guaranteeing high-quality compression whenever the kernel or data matrix is approximately low-rank. To demonstrate the broad applicability of the techniques, we design practical sub-Gaussian thinning approaches that improve upon the best known guarantees for approximating attention in transformers, accelerating stochastic gradient training through reordering, and distinguishing distributions in near-linear time.

Paperid:325 poster

Authors:Zechen Bai, Hai Ci, Mike Zheng Shou

Title: Impossible Videos

Abstract:
Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today's video generation models effectively follow prompts to create impossible video content? 2) Are today's video understanding models good enough for understanding impossible videos? To this end, we introduceIPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation.IPV-Benchis underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.

Paperid:326 poster

Authors:Haoyu Wei, Runzhe Wan, Lei Shi, Rui Song

Title: Zero-Inflated Bandits

Abstract:
Many real-world bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problem-specific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.

Paperid:327 poster

Authors:Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, Volodymyr Kuleshov

Title: The Diffusion Duality

Abstract:
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process,doubling training speedby reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithmunlocks few-step generation in diffusion language modelsby accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page:http://s-sahoo.github.io/duo

Paperid:328 poster

Authors:Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig

Title: Agent Workflow Memory

Abstract:
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks �?Mind2Web and WebArena �?that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.

Paperid:329 poster

Authors:Vincent-Pierre Berges, Barlas Oguz, Daniel HAZIZA, Wen-tau Yih, Luke Zettlemoyer, Gargi Ghosh

Title: Memory Layers at Scale

Abstract:
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.

Paperid:330 poster

Authors:Pranav Ajit Nair, Puranjay Datta, Jeff Dean, Prateek Jain, Aditya Kusupati

Title: Matryoshka Quantization

Abstract:
Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models �?especially to low precisions like int4 or int2 �?requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. Leveraging this insight, in this paper, we propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that alleviates the aforementioned challenge. This technique allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment. Furthermore, leveraging MatQuant’s co-training and co-distillation, int2 precision models extracted by MatQuant outperform standard int2 quantization by up to 4% and 7% with OmniQuant and QAT as base algorithms respectively. Finally, we demonstrate that by using an extra bit to represent outliers, a model with an effective precision of 2.05-bit improves further by 6% with OmniQuant as the base algorithm.

Paperid:331 poster

Authors:Hohyun Kim, Seunggeun Lee, Min-hwan Oh

Title: Symmetry-Aware GFlowNets

Abstract:
Generative Flow Networks (GFlowNets) offer a powerful framework for sampling graphs in proportion to their rewards. However, existing approaches suffer from systematic biases due to inaccuracies in state transition probability computations. These biases, rooted in the inherent symmetries of graphs, impact both atom-based and fragment-based generation schemes. To address this challenge, we introduce Symmetry-Aware GFlowNets (SA-GFN), a method that incorporates symmetry corrections into the learning process through reward scaling. By integrating bias correction directly into the reward structure, SA-GFN eliminates the need for explicit state transition computations. Empirical results show that SA-GFN enables unbiased sampling while enhancing diversity and consistently generating high-reward graphs that closely match the target distribution.

Paperid:332 poster

Authors:Leo Maxime Brunswic, Mateo Clémente, Rui Heng Yang, Adam Sigal, Amir Rasouli, Yinchuan Li

Title: Ergodic Generative Flows

Abstract:
Generative Flow Networks (GFNs) were initially introduced on directed non-acyclic graphs to sample from an unnormalized distribution density. Recent works have extended the theoretical framework for generative methods allowing more flexibility and enhancing application range. However, many challenges remain in training GFNs in continuous settings and for imitation learning (IL), including intractability of flow-matching loss, limited tests of non-acyclic training, and the need for a separate reward model in imitation learning. The present work proposes a family of generative flows called Ergodic Generative Flows (EGFs) which are used to address the aforementioned issues. First, we leverage ergodicity to build simple generative flows with finitely many globally defined transformations (diffeomorphisms) with universality guarantees and tractable flow-matching loss (FM loss). Second, we introduce a new loss involving cross-entropy coupled to weak flow-matching control, coined KL-weakFM loss. It is designed for IL training without a separate reward model. We evaluate IL-EGFs on toy 2D tasks and real-world datasets from NASA on the sphere, using the KL-weakFM loss. Additionally, we conduct toy 2D reinforcement learning experiments with a target reward, using the FM loss.

Paperid:333 poster

Authors:Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M. Patel

Title: Perception in Reflection

Abstract:
We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation. Project Page:https://weiyana.github.io/Perception-in-Reflection

Paperid:334 poster

Authors:Jiaxin Ye, Boyuan Cao, Hongming Shan

Title: Emotional Face-to-Speech

Abstract:
How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termedemotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduceDEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance training efficiency and generation quality, we further introduce a coarse-to-fine curriculum learning algorithm for multi-level token processing. In addition, we develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively. Extensive experimental results demonstrate that DEmoFace generates more natural and consistent speech compared to baselines, even surpassing speech-driven methods. Demos of DEmoFace are shown at our projecthttps://demoface.github.io.

Paperid:335 poster

Authors:Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russell Webb

Title: Distillation Scaling Laws

Abstract:
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Paperid:336 poster

Authors:Ariel D. Procaccia, Benjamin Schiffer, Shirley Zhang

Title: Clone-Robust AI Alignment

Abstract:
A key challenge in training Large Language Models (LLMs) is properly aligning them with human preferences. Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions and has emerged as a popular alignment method. However, input datasets in RLHF can be unbalanced due to adversarial manipulation or inadvertent repetition. Therefore, we want RLHF algorithms to perform well even when the set of alternatives is not uniformly distributed. Drawing on insights from social choice theory, we introduce robustness to approximate clones, a desirable property of RLHF algorithms which requires that adding near-duplicate alternatives does not significantly change the learned reward function. We first demonstrate that the standard RLHF algorithm based on regularized maximum likelihood estimation (MLE) fails to satisfy this property. We then propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized MLE by weighting alternatives based on their similarity to other alternatives. This new algorithm guarantees robustness to approximate clones while preserving desirable theoretical properties.

Paperid:337 poster

Authors:Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, Renjing Xu

Title: Simple Policy Optimization

Abstract:
Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region, backed by strong theoretical guarantees. However, its reliance on complex second-order optimization limits its practical efficiency. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO's approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm. By slightly modifying the policy loss used in PPO, SPO can achieve the best of both worlds. Our new objective improves upon ratio clipping, offering stronger theoretical properties and better constraining the probability ratio within the trust region. Empirical results demonstrate that SPO outperforms PPO with a simple implementation, particularly for training large, complex network architectures end-to-end.

Paperid:338 poster

Authors:Yuzheng Hu, Fan Wu, Ruicheng Xian, Yuhang Liu, Lydia Zakynthinou, Pritish Kamath, Chiyuan Zhang, David Forsyth

Title: Empirical Privacy Variance

Abstract:
We propose the notion of empirical privacy variance and study it in the context of differentially private fine-tuning of language models. Specifically, we show that models calibrated to the same(ε,δ)-DP guarantee using DP-SGD with different hyperparameter configurations can exhibit significant variations in empirical privacy, which we quantify through the lens of memorization. We investigate the generality of this phenomenon across multiple dimensions and discuss why it is surprising and relevant. Through regression analysis, we examine how individual and composite hyperparameters influence empirical privacy. The results reveal a no-free-lunch trade-off: existing practices of hyperparameter tuning in DP-SGD, which focus on optimizing utility under a fixed privacy budget, often come at the expense of empirical privacy. To address this, we propose refined heuristics for hyperparameter selection that explicitly account for empirical privacy, showing that they are both precise and practically useful. Finally, we take preliminary steps to understand empirical privacy variance. We propose two hypotheses, identify limitations in existing techniques like privacy auditing, and outline open questions for future research.

Paperid:339 poster

Authors:Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan

Title: Autonomy-of-Experts Models

Abstract:
Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and learning. To address this, we propose Autonomy-of-Expert (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

Paperid:340 poster

Authors:Jack Cai, Ammar Vora, Randolph Zhang, Mark O'Connor, Mark C. Jeffrey

Title: Attention-Level Speculation

Abstract:
As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form—attention-level speculative parallelism (ALSpec)—that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent's NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models.

Paperid:341 poster

Authors:Daniel Csillag, Claudio Jose Struchiner, Guilherme Tegoni Goedert

Title: Prediction-Powered E-Values

Abstract:
Quality statistical inference requires a sufficient amount of data, which can be missing or hard to obtain. To this end, prediction-powered inference has risen as a promising methodology, but existing approaches are largely limited to Z-estimation problems such as inference of means and quantiles. In this paper, we apply ideas of prediction-powered inference to e-values. By doing so, we inherit all the usual benefits of e-values -- such as anytime-validity, post-hoc validity and versatile sequential inference -- as well as greatly expand the set of inferences achievable in a prediction-powered manner. In particular, we show that every inference procedure that can be framed in terms of e-values has a prediction-powered counterpart, given by our method. We showcase the effectiveness of our framework across a wide range of inference tasks, from simple hypothesis testing and confidence intervals to more involved procedures for change-point detection and causal discovery, which were out of reach of previous techniques. Our approach is modular and easily integrable into existing algorithms, making it a compelling choice for practical applications.

Paperid:342 poster

Authors:Yin Lu, Xuening Zhu, Tong He, David Wipf

Title: Sparse Autoencoders, Again?

Abstract:
Is there really much more to say about sparse autoencoders (SAEs)? Autoencoders in general, and SAEs in particular, represent deep architectures that are capable of modeling low-dimensional latent structure in data. Such structure could reflect, among other things, correlation patterns in large language model activations, or complex natural image manifolds. And yet despite the wide-ranging applicability, there have been relatively few changes to SAEs beyond the original recipe from decades ago, namely, standard deep encoder/decoder layers trained with a classical/deterministic sparse regularizer applied within the latent space. One possible exception is the variational autoencoder (VAE), which adopts a stochastic encoder module capable of producing sparse representations when applied to manifold data. In this work we formalize underappreciated weaknesses with both canonical SAEs, as well as analogous VAEs applied to similar tasks, and propose a hybrid alternative model that circumvents these prior limitations. In terms of theoretical support, we prove that global minima of our proposed model recover certain forms of structured data spread across a union of manifolds. Meanwhile, empirical evaluations on synthetic and real-world datasets substantiate the efficacy of our approach in accurately estimating underlying manifold dimensions and producing sparser latent representations without compromising reconstruction error. In general, we are able to exceed the performance of equivalent-capacity SAEs and VAEs, as well as recent diffusion models where applicable, within domains such as images and language model activation patterns.

Paperid:343 poster

Authors:Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, Philip Alexander Teare

Title: Diffusion Instruction Tuning

Abstract:
We introduceLavender, a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. Specifically, Lavender aligns the text-vision attention in the VLM transformer with the equivalent used by Stable Diffusion during SFT, instead of adapting separate encoders. This alignment enriches the model’s visual understanding and significantly boosts performance across in- and out-of-distribution tasks. Lavender requires just 0.13 million training examples---2.5% of typical large-scale SFT datasets---and fine-tunes on standard hardware (8 GPUs) in a single day. It consistently improves state-of-the-art open-source multimodal LLMs (e.g., Llama-3.2-11B, MiniCPM-Llama3-v2.5), achieving up to 30% gains and a 68% boost on challenging out-of-distribution medical QA tasks. By efficiently transferring the visual expertise of image generators with minimal supervision, Lavender offers a scalable solution for more accurate vision-language systems. Code, training data, and models are available on theproject page.

Paperid:344 poster

Authors:Kunwoong Kim, Jihu Lee, Sangchul Park, Yongdai Kim

Title: Fair Clustering via Alignment

Abstract:
Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute. While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice. To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair $K$-means clustering objective function. The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space. A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling high-utility fair clustering in practice. Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving near-perfect fairness without numerical instability.

Paperid:345 poster

Authors:Edward Milsom, Ben Anson, Laurence Aitchison

Title: Function-Space Learning Rates

Abstract:
We consider layerwise function-space learning rates, which measure the magnitude of the change in a neural network's output function in response to an update to a parameter tensor. This contrasts with traditional learning rates, which describe the magnitude of changes in parameter space. We develop efficient methods to measure and set function-space learning rates in arbitrary neural networks, requiring only minimal computational overhead through a few additional backward passes that can be performed at the start of, or periodically during, training. We demonstrate two key applications: (1) analysing the dynamics of standard neural network optimisers in function space, rather than parameter space, and (2) introducing FLeRM (Function-space Learning Rate Matching), a novel approach to hyperparameter transfer across model scales. FLeRM records function-space learning rates while training a small, cheap base model, then automatically adjusts parameter-space layerwise learning rates when training larger models to maintain consistent function-space updates. FLeRM gives hyperparameter transfer across model width, depth, initialisation scale, and LoRA rank in various architectures including MLPs with residual connections and transformers with different layer normalisation schemes.

Paperid:346 poster

Authors:Wei Zhuo, Han Yu, Guang Tan, Xiaoxiao Li

Title: Commute Graph Neural Networks

Abstract:
Graph Neural Networks (GNNs) have shown remarkable success in learning from graph-structured data. However, their application to directed graphs (digraphs) presents unique challenges, primarily due to the inherent asymmetry in node relationships. Traditional GNNs are adept at capturing unidirectional relations but fall short in encoding the mutual path dependencies between nodes, such as asymmetrical shortest paths typically found in digraphs. Recognizing this gap, we introduce Commute Graph Neural Networks (CGNN), an approach that seamlessly integrates node-wise commute time into the message passing scheme. The cornerstone of CGNN is an efficient method for computing commute time using a newly formulated digraph Laplacian. Commute time is then integrated into the neighborhood aggregation process, with neighbor contributions weighted according to their respective commute time to the central node in each layer. It enables CGNN to directly capture the mutual, asymmetric relationships in digraphs. Extensive experiments on 8 benchmarking datasets confirm the superiority of CGNN against 13 state-of-the-art methods.

Paperid:347 poster

Authors:Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann

Title: History-Guided Video Diffusion

Abstract:
Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Project website:https://boyuan.space/history-guidance

Paperid:348 poster

Authors:Honglin Yuan, Xingfeng Li, Jian Dai, Xiaojian You, Yuan Sun, Zhenwen Ren

Title: Deep Streaming View Clustering

Abstract:
Existing deep multi-view clustering methods have demonstrated excellent performance, which addressing issues such as missing views and view noise. But almost all existing methods are within a static framework, which assumes that all views have already been collected. However, in practical scenarios, new views are continuously collected over time, which forms the stream of views. Additionally, there exists the data imbalance of quality and distribution between different view streams, i.e., concept drift problem. To this end, we propose a novel Deep Streaming View Clustering (DSVC) method, which mitigates the impact of concept drift on streaming view clustering. Specifically, DSVC consists of a knowledge base and three core modules. Through the knowledge aggregation learning module, DSVC extracts representative features and prototype knowledge from the new view. Subsequently, the distribution consistency learning module aligns the prototype knowledge from the current view with the historical knowledge distribution to mitigate the impact of concept drift. Then, the knowledge guidance learning module leverages the prototype knowledge to guide the data distribution and enhance the clustering structure. Finally, the prototype knowledge from the current view is updated in the knowledge base to guide the learning of subsequent views. Extensive experiments demonstrate that, even in dynamic environments, the clustering performance of DSVC outperforms 12 state-of-the-art DMVC methods under static frameworks.

Paperid:349 poster

Authors:Ananth Raman, Vinod Raman

Title: Generation from Noisy Examples

Abstract:
We continue to study the learning-theoretic foundations of generation by extending the results from Kleinberg and Mullainathan [2024] and Li et al. [2024] to account for noisy example streams. In the noiseless setting of Kleinberg and Mullainathan [2024] and Li et al. [2024], an adversary picks a hypothesis from a binary hypothesis class and provides a generator with a sequence of its positive examples. The goal of the generator is to eventually output new, unseen positive examples. In the noisy setting, an adversary still picks a hypothesis and a sequence of its positive examples. But, before presenting the stream to the generator, the adversary inserts a finite number of negative examples. Unaware of which examples are noisy, the goal of the generator is to still eventually output new, unseen positive examples. In this paper, we provide necessary and sufficient conditions for when a binary hypothesis class can be noisily generatable. We provide such conditions with respect to various constraints on the number of distinct examples that need to be seen before perfect generation of positive examples. Interestingly, for finite and countable classes we show that generatability is largely unaffected by the presence of a finite number of noisy examples.

Paperid:350 poster

Authors:Idan Attias, Avrim Blum, Keziah Naggita, Donya Saless, Dravyansh Sharma, Matthew Walter

Title: PAC Learning with Improvements

Abstract:
One of the most basic lower bounds in machine learning is that in nearly any nontrivial setting, it takes at least $1/\epsilon$ samples to learn to error $\epsilon$ (and more, if the classifier being learned is complex). However, suppose that data points are agents who have the ability to improve by a small amount if doing so will allow them to receive a (desired) positive classification. In that case, we may actually be able to achieve zero error by just being "close enough". For example, imagine a hiring test used to measure an agent's skill at some job such that for some threshold $\theta$, agents who score above $\theta$ will be successful and those who score below $\theta$ will not (i.e., learning a threshold on the line). Suppose also that by putting in effort, agents can improve their skill level by some small amount $r$. In that case, if we learn an approximation $\hat{\theta}$ of $\theta$ such that $\theta \leq \hat{\theta} \leq \theta + r$ and use it for hiring, we can actually achieve error zero, in the sense that (a) any agent classified as positive is truly qualified, and (b) any agent who truly is qualified can be classified as positive by putting in effort. Thus, the ability for agents to improve has the potential to allow for a goal one could not hope to achieve in standard models, namely zero error.In this paper, we explore this phenomenon more broadly, giving general results and examining under what conditions the ability of agents to improve can allow for a reduction in the sample complexity of learning, or alternatively, can make learning harder. We also examine both theoretically and empirically what kinds of improvement-aware algorithms can take into account agents who have the ability to improve to a limited extent when it is in their interest to do so.

Paperid:351 poster

Authors:Gefan Yang, Frank van der Meulen, Stefan Sommer

Title: Neural Guided Diffusion Bridges

Abstract:
We propose a novel method for simulating conditioned diffusion processes (diffusion bridges) in Euclidean spaces. By training a neural network to approximate bridge dynamics, our approach eliminates the need for computationally intensive Markov Chain Monte Carlo (MCMC) methods or reverse-process modeling. Compared to existing methods, it offers greater robustness across various diffusion specifications and conditioning scenarios. This applies in particular to rare events and multimodal distributions, which pose challenges for score-learning- and MCMC-based approaches. We propose a flexible variational family for approximating the diffusion bridge path measure which is partially specified by a neural network. Once trained, it enables efficient independent sampling at a cost comparable to sampling the unconditioned (forward) process.

Paperid:352 poster

Authors:Linjing You, Jiabao Lu, Xiayuan Huang

Title: Test-time Correlation Alignment

Abstract:
Deep neural networks often degrade under distribution shifts. Although domain adaptation offers a solution, privacy constraints often prevent access to source data, making Test-Time Adaptation (TTA)—which adapts using only unlabeled test data—increasingly attractive. However, current TTA methods still face practical challenges: (1) a primary focus on instance-wise alignment, overlooking CORrelation ALignment (CORAL) due to missing source correlations; (2) complex backpropagation operations for model updating, resulting in overhead computation and (3) domain forgetting. To address these challenges, we provide a theoretical analysis to investigate the feasibility ofTest-timeCorrelationAlignment (TCA), demonstrating that correlation alignment between high-certainty instances and test instances can enhance test performances with a theoretical guarantee. Based on this, we propose two simple yet effective algorithms: LinearTCA and LinearTCA+. LinearTCA applies a simple linear transformation to achieve both instance and correlation alignment without additional model updates, while LinearTCA+ serves as a plug-and-play module that can easily boost existing TTA methods. Extensive experiments validate our theoretical insights and show that TCA methods significantly outperforms baselines across various tasks, benchmarks and backbones. Notably, LinearTCA achieves higher accuracy with only 4% GPU memory and 0.6% computation time compared to the best TTA baseline. It also outperforms existing methods on CLIP over 1.86%. Code:https://github.com/youlj109/TCA

Paperid:353 poster

Authors:David Pfau, Ian Davies, Diana L Borsa, João Guilherme Madeira Araújo, Brendan Daniel Tracey, Hado van Hasselt

Title: Wasserstein Policy Optimization

Abstract:
We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of thegradientof the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.

Paperid:354 poster

Authors:Longlin Yu, Jiajun Zha, Tong Yang, Tianyu Xie, Xiangyu Zhang, S.-H. Chan, Cheng Zhang

Title: Continuous Semi-Implicit Models

Abstract:
Semi-implicit distributions have shown great promise in variational inference and generative modeling. Hierarchical semi-implicit models, which stack multiple semi-implicit layers, enhance the expressiveness of semi-implicit distributions and can be used to accelerate diffusion models given pretrained score networks. However, their sequential training often suffers from slow convergence. In this paper, we introduce CoSIM, a continuous semi-implicit model that extends hierarchical semi-implicit models into a continuous framework. By incorporating a continuous transition kernel, CoSIM enables efficient, simulation-free training. Furthermore, we show that CoSIM achieves consistency with a carefully designed transition kernel, offering a novel approach for multistep distillation of generative models at the distributional level. Extensive experiments on image generation demonstrate that CoSIM performs on par or better than existing diffusion model acceleration methods, achieving superior performance on FD-DINOv2.

Paperid:355 poster

Authors:Divyat Mahajan, Mohammad Pezeshki, Charles Arnal, Ioannis Mitliagkas, Kartik Ahuja, Pascal Vincent

Title: Compositional Risk Minimization

Abstract:
Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

Paperid:356 poster

Authors:Kelly Ramsay, Jairo Diaz-Rodriguez

Title: Differentially Private Boxplots

Abstract:
Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently, which is not always the case for a boxplot naively constructed from a single existing differentially private quantile algorithm. As a byproduct of this exposition, we introduce several new results concerning private quantile estimation. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms the naive boxplot. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization.

Paperid:357 poster

Authors:Simone Drago, Marco Mussi, Alberto Maria Metelli

Title: Sleeping Reinforcement Learning

Abstract:
In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent's control. To address these realistic scenarios, we introduce a novel paradigm calledSleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, where $H$ is the horizon, $S$ and $A$ are the cardinalities of the state and action spaces, respectively, and $T$ is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order $\Omega(\sqrt{T 2^{A/2}})$ when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, showing that the problem shares the same complexity of standard RL.

Paperid:358 poster

Authors:Yongjian Zhong, Liao Zhu, Hieu Vu, Bijaya Adhikari

Title: Implicit Subgraph Neural Network

Abstract:
Subgraph neural networks have recently gained prominence for various subgraph-level predictive tasks. However, existing methods either \emph{1)} apply simple standard pooling over graph convolutional networks, failing to capture essential subgraph properties, or \emph{2)} rely on rigid subgraph definitions, leading to suboptimal performance. Moreover, these approaches fail to model long-range dependencies both between and within subgraphs—a critical limitation, as many real-world networks contain subgraphs of varying sizes and connectivity patterns.In this paper, we propose a novel implicit subgraph neural network, the first of its kind, designed to capture dependencies across subgraphs. Our approach also integrates label-aware subgraph-level information. We formulate implicit subgraph learning as a bilevel optimization problem and develop a provably convergent algorithm that requires fewer gradient estimations than standard bilevel optimization methods.We evaluate our approach on real-world networks against state-of-the-art baselines, demonstrating its effectiveness and superiority.

Paperid:359 poster

Authors:Shreyas Kadekodi, Hayden McTavish, Berk Ustun

Title: Selective Preference Aggregation

Abstract:
Many applications in machine learning and decision-making rely on procedures to aggregate human preferences.In such tasks, individual express ordinal preferences over a set of items through votes, ratings, or pairwise comparisons. We then summarize their collective preferences as a ranking. Standard methods for preference aggregation are designed to return rankings that arbitrate individual disagreements in ways that are faithful and fair. In this work, we introduce a paradigm forselective aggregation, where we can avoid the need to arbitrate dissent by abstaining from comparison. We summarize collective preferences as aselective ranking-- i.e., a partial order where we can only compare items where at least $100\cdot(1 - \tau)%$ of individuals agree. We develop algorithms to build selective rankings that achieve all possible trade-offs between comparability and disagreement, and derive formal guarantees on their safety and stability. We conduct an extensive set of experiments on real-world datasets to benchmark our approach and demonstrate its functionality. Our results show selective aggregation can promote transparency and robustness by revealing disagreement and abstaining from arbitration.

Paperid:360 poster

Authors:Jeremy Bernstein, Laker Newhouse

Title: Modular Duality in Deep Learning

Abstract:
An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers—the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training NanoGPT and was scaled up to a 1.5 billion parameter transformer.

Paperid:361 poster

Authors:Xuanshu Luo, Martin Werner

Title: One-dimensional Path Convolution

Abstract:
Two-dimensional (2D) convolutional kernels have dominated convolutional neural networks (CNNs) in image processing. While linearly scaling 1D convolution provides parameter efficiency, its naive integration into CNNs disrupts image locality, thereby degrading performance. This paper presents path convolution (PathConv), a novel CNN design exclusively with 1D operations, achieving ResNet-level accuracy using only 1/3 parameters. To obtain locality-preserving image traversal paths, we analyze Hilbert/Z-order paths and expose a fundamental trade-off: improved proximity for most pixels comes at the cost of excessive distances for other sacrificed ones to their neighbors. We resolve this issue by proposing path shifting, a succinct method to reposition sacrificed pixels. Using the randomized rounding algorithm, we show that three shifted paths are sufficient to offer better locality preservation than trivial raster scanning. To mitigate potential convergence issues caused by multiple paths, we design a lightweight path-aware channel attention mechanism to capture local intra-path and global inter-path dependencies. Experimental results further validate the efficacy of our method, establishing the proposed 1D PathConv as a viable backbone for efficient vision models.

Paperid:362 poster

Authors:Long-Kai Huang, Rongyi Zhu, Bing He, Jianhua Yao

Title: Steering Protein Language Models

Abstract:
Protein Language Models (PLMs), pre-trained on extensive evolutionary data from natural proteins, have emerged as indispensable tools for protein design. While powerful, PLMs often struggle to produce proteins with precisely specified functionalities or properties due to inherent challenges in controlling their outputs. In this work, we investigate the potential of Activation Steering, a technique originally developed for controlling text generation in Large Language Models (LLMs), to direct PLMs toward generating protein sequences with targeted properties. We propose a simple yet effective method that employs activation editing to steer PLM outputs, and extend this approach to protein optimization through a novel editing site identification module. Through comprehensive experiments on lysozyme-like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into both auto-encoding and autoregressive PLMs without requiring additional training. These results highlight a promising direction for precise protein engineering using foundation models.

Paperid:363 poster

Authors:Tian Bai, Yue Zhao, Xiang Yu, Archer Y. Yang

Title: Multivariate Conformal Selection

Abstract:
Selecting high-quality candidates from large datasets is critical in applications such as drug discovery, precision medicine, and alignment of large language models (LLMs). While Conformal Selection (CS) provides rigorous uncertainty quantification, it is limited to univariate responses and scalar criteria. To address this, we propose Multivariate Conformal Selection (mCS), a generalization of CS designed for multivariate response settings. Our method introduces regional monotonicity and employs multivariate nonconformity scores to construct conformal $p$-values, enabling finite-sample False Discovery Rate (FDR) control. We present two variants: $\texttt{mCS-dist}$, using distance-based scores, and $\texttt{mCS-learn}$, which learns optimal scores via differentiable optimization. Experiments on simulated and real-world datasets demonstrate that mCS significantly improves selection power while maintaining FDR control, establishing it as a robust framework for multivariate selection tasks.

Paperid:364 poster

Authors:Jonathan Richens, Tom Everitt, David Abel

Title: General agents need world models

Abstract:
Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent's policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

Paperid:365 poster

Authors:Deyu Bo, Xinchao Wang

Title: Point Cloud Dataset Distillation

Abstract:
This study introduces dataset distillation (DD) tailored for 3D data, particularly point clouds. DD aims to substitute large-scale real datasets with a small set of synthetic samples while preserving model performance. Existing methods mainly focus on structured data such as images. However, adapting DD for unstructured point clouds poses challenges due to their diverse orientations and resolutions in 3D space. To address these challenges, we theoretically demonstrate the importance of matching rotation-invariant features between real and synthetic data for 3D distillation. We further propose a plug-and-play point cloud rotator to align the point cloud to a canonical orientation, facilitating the learning of rotation-invariant features by all point cloud models. Furthermore, instead of optimizing fixed-size synthetic data directly, we devise a point-wise generator to produce point clouds at various resolutions based on the sampled noise amount. Compared to conventional DD methods, the proposed approach, termed DD3D, enables efficient training on low-resolution point clouds while generating high-resolution data for evaluation, thereby significantly reducing memory requirements and enhancing model scalability. Extensive experiments validate the effectiveness of DD3D in shape classification and part segmentation tasks across diverse scenarios, such as cross-architecture and cross-resolution settings.

Paperid:366 poster

Authors:Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, Minhao Cheng

Title: Safety Reasoning with Guidelines

Abstract:
Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines that reflect diverse perspectives on safety knowledge. This encourages model to engage in deeper reasoning, explicitly eliciting and utilizing latent safety knowledge for each query. Extensive experiments show that our method significantly improves model generalization against OOD attacks.

Paperid:367 poster

Authors:Xiaorui Su, Shvat Messica, Yepeng Huang, Ruth Johnson, Lukas Fesser, Shanghua Gao, Faryad Sahneh, Marinka Zitnik

Title: Multimodal Medical Code Tokenizer

Abstract:
Foundation models trained on patient electronic health records (EHRs) require tokenizing medical data into sequences of discrete vocabulary items. Existing tokenizers treat medical codes from EHRs as isolated textual tokens. However, each medical code is defined by its textual description, its position in ontological hierarchies, and its relationships to other codes, such as disease co-occurrences and drug-treatment associations. Medical vocabularies contain more than 600,000 codes with critical information for clinical reasoning. We introduce MedTok, a multimodal medical code tokenizer that uses the text descriptions and relational context of codes. MedTok processes text using a language model encoder and encodes the relational structure with a graph encoder. It then quantizes both modalities into a unified token space, preserving modality-specific and cross-modality information. We integrate MedTok into five EHR models and evaluate it on operational and clinical tasks across in-patient and out-patient datasets, including outcome prediction, diagnosis classification, drug recommendation, and risk stratification. Swapping standard EHR tokenizers with MedTok improves AUPRC across all EHR models, by 4.10% on MIMIC-III, 4.78% on MIMIC-IV, and 11.32% on EHRShot, with the largest gains in drug recommendation. Beyond EHR modeling, we demonstrate using MedTok tokenizer with medical QA systems. Our results demonstrate the potential of MedTok as a unified tokenizer for medical codes, improving tokenization for medical foundation models.

Paperid:368 poster

Authors:Koen Minartz, Tim D'Hondt, Leon Hillmann, Jörn Starruß, Lutz Brusch, Vlado Menkovski

Title: Deep Neural Cellular Potts Models

Abstract:
The cellular Potts model (CPM) is a powerful computational method for simulating collective spatiotemporal dynamics of biological cells. To drive the dynamics, CPMs rely on physics-inspired Hamiltonians. However, as first principles remain elusive in biology, these Hamiltonians only approximate the full complexity of real multicellular systems. To address this limitation, we propose NeuralCPM, a more expressive cellular Potts model that can be trained directly on observational data. At the core of NeuralCPM lies the Neural Hamiltonian, a neural network architecture that respects universal symmetries in collective cellular dynamics. Moreover, this approach enables seamless integration of domain knowledge by combining known biological mechanisms and the expressive Neural Hamiltonian into a hybrid model. Our evaluation with synthetic and real-world multicellular systems demonstrates that NeuralCPM is able to model cellular dynamics that cannot be accounted for by traditional analytical Hamiltonians.

Paperid:369 poster

Authors:Joel Daniel Andersson, Lukas Retschmeier, Boel Nelson, Rasmus Pagh

Title: Private Lossless Multiple Release

Abstract:
Koufogiannis et al. (2016) showed a $\textit{gradual release}$ result for Laplace noise-based differentially private mechanisms: given an $\varepsilon$-DP release, a new release with privacy parameter $\varepsilon' > \varepsilon$ can be computed such that the combined privacy loss of both releases is at most $\varepsilon'$ and the distribution of the latter is the same as a single release with parameter $\varepsilon'$. They also showed gradual release techniques for Gaussian noise, later also explored by Whitehouse et al. (2022).In this paper, we consider a more general $\textit{multiple release}$ setting in which analysts hold private releases with different privacy parameters corresponding to different access/trust levels. These releases are determined one by one, with privacy parameters in arbitrary order. A multiple release is $\textit{lossless}$ if having access to a subset $S$ of the releases has the same privacy guarantee as the least private release in $S$, and each release has the same distribution as a single release with the same privacy parameter. Our main result is that lossless multiple release is possible for a large class of additive noise mechanisms. For the Gaussian mechanism we give a simple method for lossless multiple release with a short, self-contained analysis that does not require knowledge of the mathematics of Brownian motion. We also present lossless multiple release for the Laplace and Poisson mechanisms. Finally, we consider how to efficiently do gradual release of sparse histograms, and present a mechanism with running time independent of the number of dimensions.

Paperid:370 poster

Authors:Arturs Berzins, Andreas Radler, Eric Volkmann, Sebastian Sanokowski, Sepp Hochreiter, Johannes Brandstetter

Title: Geometry-Informed Neural Networks

Abstract:
Geometry is a ubiquitous tool in computer graphics, design, and engineering. However, the lack of large shape datasets limits the application of state-of-the-art supervised learning methods and motivates the exploration of alternative learning strategies. To this end, we introduce geometry-informed neural networks (GINNs) -- a framework for training shape-generative neural fields without data by leveraging user-specified design requirements in the form of objectives and constraints. By adding diversity as an explicit constraint, GINNs avoid mode-collapse and can generate multiple diverse solutions, often required in geometry tasks. Experimentally, we apply GINNs to several problems spanning physics, geometry, and engineering design, showing control over geometrical and topological properties, such as surface smoothness or the number of holes. These results demonstrate the potential of training shape-generative models without data, paving the way for new generative design approaches without large datasets.

Paperid:371 poster

Authors:Zhen Wang, Yilei JIANG, Dong Zheng, Jun Xiao, Long Chen

Title: Event-Customized Image Generation

Abstract:
Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.

Paperid:372 poster

Authors:Arman Sharifi Kolarijani, Tolga Ok, Peyman Mohajerin Esfahani, Mohamad Amin Sharifi Kolarijani

Title: Rank-One Modified Value Iteration

Abstract:
In this paper, we provide a novel algorithm for solving planning and learning problems of Markov decision processes. The proposed algorithm follows a policy iteration-type update by using a rank-one approximation of the transition probability matrix in the policy evaluation step. This rank-one approximation is closely related to the stationary distribution of the corresponding transition probability matrix, which is approximated using the power method. We provide theoretical guarantees for the convergence of the proposed algorithm to optimal (action-)value function with the same rate and computational complexity as the value iteration algorithm in the planning problem and as the Q-learning algorithm in the learning problem. Through our extensive numerical simulations, however, we show that the proposed algorithm consistently outperforms first-order algorithms and their accelerated versions for both planning and learning problems.

Paperid:373 poster

Authors:Zhichen Dong, Zhanhui Zhou, Zhixuan Liu, Chao Yang, Chaochao Lu

Title: Emergent Response Planning in LLMs

Abstract:
In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors:their hidden representations encode future outputs beyond the next token. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, includingstructure attributes(e.g., response length, reasoning steps),content attributes(e.g., character choices in storywriting, multiple-choice answers at the end of response), andbehavior attributes(e.g., answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggest potential applications for improving transparency and generation control.

Paperid:374 poster

Authors:Kenneth Li, Yida Chen, Fernanda Viégas, Martin Wattenberg

Title: When Bad Data Leads to Good Models

Abstract:
In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.

Paperid:375 poster

Authors:Jonathan Geuter, Gregor Kornhardt, Ingimar Tomasson, Vaios Laschos

Title: Universal Neural Optimal Transport

Abstract:
Optimal Transport (OT) problems are a cornerstone of many applications, but solving them is computationally expensive. To address this problem, we propose UNOT (Universal Neural Optimal Transport), a novel framework capable of accurately predicting (entropic) OT distances and plans between discrete measures of variable resolution for a given cost function. UNOT builds on Fourier Neural Operators, a universal class of neural networks that map between function spaces and that are discretization-invariant, which enables our network to process measures of varying sizes. The network is trained adversarially using a second, generating network and a self-supervised bootstrapping loss. We theoretically justify the use of FNOs, prove that our generator is universal, and that minimizing the bootstrapping loss provably minimizes the ground truth loss. Through extensive experiments, we show that our network not only accurately predicts optimal transport distances and plans across a wide range of datasets, but also captures the geometry of the Wasserstein space correctly. Furthermore, we show that our network can be used as a state-of-the-art initialization for the Sinkhorn algorithm, significantly outperforming existing approaches.

Paperid:376 poster

Authors:Míriam Barrabés, Daniel Mas Montserrat, Kapal Dev, Alexander G. Ioannidis

Title: Feature Shift Localization Network

Abstract:
Feature shifts between data sources are present in many applications involving healthcare, biomedical, socioeconomic, financial, survey, and multi-sensor data, among others, where unharmonized heterogeneous data sources, noisy data measurements, or inconsistent processing and standardization pipelines can lead to erroneous features. Localizing shifted features is important to address the underlying cause of the shift and correct or filter the data to avoid degrading downstream analysis. While many techniques can detect distribution shifts, localizing the features originating them is still challenging, with current solutions being either inaccurate or not scalable to large and high-dimensional datasets. In this work, we introduce the Feature Shift Localization Network (FSL-Net), a neural network that can localize feature shifts in large and high-dimensional datasets in a fast and accurate manner. The network, trained with a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts from previously unseen datasets and shifts without the need for re-training. The code and ready-to-use trained model are available at \url{https://github.com/AI-sandbox/FSL-Net}.

Paperid:377 poster

Authors:Philipp Misof, Pan Kessel, Jan E Gerken

Title: Equivariant Neural Tangent Kernels

Abstract:
Little is known about the training dynamics of equivariant neural networks, in particular how it compares to data augmented training of their non-equivariant counterparts. Recently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically study the training dynamics of wide neural networks. In this work, we take an important step towards a theoretical understanding of training dynamics of equivariant models by deriving neural tangent kernels for a broad class of equivariant architectures based on group convolutions. As a demonstration of the capabilities of our framework, we show an interesting relationship between data augmentation and group convolutional networks. Specifically, we prove that they share the same expected prediction over initializations at all training times and even off the data manifold. In this sense, they have the same training dynamics. We demonstrate in numerical experiments that this still holds approximately for finite-width ensembles. By implementing equivariant NTKs for roto-translations in the plane (G=Cn⋉R2) and 3d rotations (G=SO(3)), we show that equivariant NTKs outperform their non-equivariant counterparts as kernel predictors for histological image classification and quantum mechanical property prediction.

Paperid:378 poster

Authors:Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale Fung

Title: Subobject-level Image Tokenization

Abstract:
Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection--a simple task that can be handled well by a compact model--with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs with different tokenizers on 4 datasets of object recognition and detailed captioning. The results reveal that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.

Paperid:379 poster

Authors:Hilal Asi, Vinod Raman, Aadirupa Saha

Title: Tracking The Best Expert Privately

Abstract:
We design differentially private algorithms for the problem of prediction with expert advice under dynamic regret, also known as tracking the best expert. Our work addresses three natural types of adversaries, stochastic with shifting distributions, oblivious, and adaptive, and designs algorithms with sub-linear regret for all three cases. In particular, under a shifting stochastic adversary where the distribution may shift $S$ times, we provide an $\epsilon$-differentially private algorithm whose expected dynamic regret is at most $O\left( \sqrt{S T \log (NT)} + \frac{S \log (NT)}{\epsilon}\right)$, where $T$ and $N$ are the time horizon and number of experts, respectively. For oblivious adversaries, we give a reduction from dynamic regret minimization to static regret minimization, resulting in an upper bound of $O\left(\sqrt{S T \log(NT)} + \frac{S T^{1/3}\log(T/\delta) \log(NT)}{\epsilon ^{2/3}}\right)$ on the expected dynamic regret, where $S$ now denotes the allowable number of switches of the best expert. Finally, similar to static regret, we establish a fundamental separation between oblivious and adaptive adversaries for the dynamic setting: while our algorithms show that sub-linear regret is achievable for oblivious adversaries in the high-privacy regime $\epsilon \le \sqrt{S/T}$, we show that any $(\epsilon, \delta)$-differentially private algorithm must suffer linear dynamic regret under adaptive adversaries for $\epsilon \le \sqrt{S/T}$. Finally, to complement this lower bound, we give an $\epsilon $-differentially private algorithm that attains sub-linear dynamic regret under adaptive adversaries whenever $\epsilon \gg \sqrt{S/T}$.

Paperid:380 poster

Authors:Charlotte Peale, Vinod Raman, Omer Reingold

Title: Representative Language Generation

Abstract:
We introduce "representative generation," extending the theoretical framework for generation proposed by Kleinberg et al. (2024) and formalized by Li et al. (2024), to additionally address diversity and bias concerns in generative models. Our notion requires outputs of a generative model to proportionally represent groups of interest from the training data. We characterize representative uniform and non-uniform generation, introducing the ``group closure dimension'' as a key combinatorial quantity. For representative generation in the limit, we analyze both information-theoretic and computational aspects, demonstrating feasibility for countably infinite hypothesis classes and collections of groups under certain conditions, but proving a negative result for computability using only membership queries. This contrasts with Kleinberg et al.'s (2024) positive results for standard generation in the limit. Our findings provide a rigorous foundation for developing more diverse and representative generative models.

Paperid:381 poster

Authors:Nikita Durasov, Assaf Shocher, Doruk Oner, Gal Chechik, Alexei A Efros, Pascal Fua

Title: IT3: Idempotent Test-Time Training

Abstract:
Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data. While existing approaches like domain adaptation and test-time training (TTT) offer partial solutions, they typically require additional data or domain-specific auxiliary tasks. We present Idempotent Test-Time Training (IT3), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance, without any auxiliary task design. Our key insight is that enforcing idempotence---where repeated applications of a function yield the same result---can effectively replace domain-specific auxiliary tasks used in previous TTT methods. We theoretically connect idempotence to prediction confidence and demonstrate that minimizing the distance between successive applications of our model during inference leads to improved out-of-distribution performance. Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT3 consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.

Paperid:382 poster

Authors:Lijie Hu, Chenyang Ren, Zhengyu Hu, Hongbin Lin, Cheng-Long Wang, Zhen Tan, Weimin Lyu, Jingfeng Zhang, Hui Xiong, Di Wang

Title: Editable Concept Bottleneck Models

Abstract:
Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.

Paperid:383 poster

Authors:Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, Jun Zhu

Title: Visual Generation Without Guidance

Abstract:
Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free.

Paperid:384 poster

Authors:Yuchen Zhang, Jian Zhou

Title: Inverse Flow and Consistency Models

Abstract:
Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and real-world applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF's utility in scientific problems. Overall, this work expands the applications of powerful generative models to inversion generation problems.

Paperid:385 poster

Authors:Gal Dalal, Assaf Hallak, Gugan Thoppe, Shie Mannor, Gal Chechik

Title: Policy Gradient with Tree Expansion

Abstract:
Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax---a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

Paperid:386 poster

Authors:Qiufu Li, Huibin Xiao, Linlin Shen

Title: BCE vs. CE in Deep Feature Learning

Abstract:
When training classification models, it expects that the learned features are compact within classes, and can well separate different classes. As the dominant loss function for training classification models, minimizing cross-entropy (CE) loss maximizes the compactness and distinctiveness, i.e., reaching neural collapse (NC). The recent works show that binary CE (BCE) performs also well in multi-class tasks. In this paper, we compare BCE and CE in deep feature learning. For the first time, we prove that BCE can also maximize the intra-class compactness and inter-class distinctiveness when reaching its minimum, i.e., leading to NC. We point out that CE measures the relative values of decision scores in the model training, implicitly enhancing the feature properties by classifying samples one-by-one. In contrast, BCE measures the absolute values of decision scores and adjust the positive/negative decision scores across all samples to uniformly high/low levels. Meanwhile, the classifier biases in BCE present a substantial constraint on the decision scores to explicitly enhance the feature properties in the training. The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes have be released.

Paperid:387 poster

Authors:Yizhou Zhang, Yian Ma, Eric Mazumdar

Title: Learning to Steer Learners in Games

Abstract:
We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finite-action games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner's objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner's payoff structure. We demonstrate the effectiveness of this approach if the learner’s algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes.

Paperid:388 poster

Authors:Niv Buchbinder, Roie Levin, Yue Yang

Title: Competitively Consistent Clustering

Abstract:
Infully-dynamic consistent clustering, we are given a finite metric space $(M,d)$, and a set $F\subseteq M$ of possible locations for opening centers. Data points arrive and depart, and the goal is to maintain an approximately optimal clustering solution at all times while minimizing therecourse, the total number of additions/deletions of centers over time. Specifically, we study fully dynamic versions of the classical $k$-center, facility location, and $k$-median problems. We design algorithms that, given a parameter $\beta\geq 1$, maintain an $O(\beta)$-approximate solution at all times, and whose total recourse is bounded by $O(\log |F| \log \Delta) \cdot OPT_{rec}^{\beta}$. Here $OPT_{rec}^{\beta}$ is the minimal recourse of an offline algorithm that maintains a $\beta$-approximate solution at all times, and $\Delta$ is the metric aspect ratio. We obtain our results via a reduction to the recently proposedPositive Body Chasingframework of [Bhattacharya Buchbinder Levin Saranurak, FOCS 2023], which we show gives fractional solutions to our clustering problems online. Our contribution is to round these fractional solutions while preserving the approximation and recourse guarantees. We complement our positive results with logarithmic lower bounds which show that our bounds are nearly tight.

Paperid:389 poster

Authors:Tian Li, Tianyi Zhou, Jeff Bilmes

Title: Tilted Sharpness-Aware Minimization

Abstract:
Sharpness-Aware Minimization (SAM) has been demonstrated to improve the generalization performance of overparameterized models by seeking flat minima on the loss landscape through optimizing model parameters that incur the largest loss within a neighborhood. Nevertheless, such min-max formulations are computationally challenging especially when the problem is highly non-convex. Additionally, focusing only on the worst-case local solution while ignoring potentially many other local solutions may be suboptimal when searching for flat minima. In this work, we propose Tilted SAM (TSAM), a smoothed generalization of SAM inspired by exponential tilting that effectively assigns higher priority to local solutions that incur larger losses. TSAM is parameterized by a tilt hyperparameter $t$ and reduces to SAM as $t$ approaches infinity. We show that TSAM is smoother than SAM and thus easier to optimize, and it explicitly favors flatter minima. We develop algorithms motivated by the discretization of Hamiltonian dynamics to solve TSAM. Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks.

Paperid:390 poster

Authors:Haobo Jiang, Jin Xie, jian Yang, liang yu, Jianmin Zheng

Title: Generative Point Cloud Registration

Abstract:
In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometric-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.

Paperid:391 poster

Authors:Pengsheng Guo, Alex Schwing

Title: Variational Rectified Flow Matching

Abstract:
We study Variational Rectified Flow Matching, a framework that enhances classic rectified flow matching by modeling multi-modal velocity vector-fields. At inference time, classic rectified flow matching 'moves' samples from a source distribution to the target distribution by solving an ordinary differential equation via integration along a velocity vector-field. At training time, the velocity vector-field is learnt by linearly interpolating between coupled samples one drawn from the source and one drawn from the target distribution randomly. This leads to ''ground-truth'' velocity vector-fields that point in different directions at the same location, i.e., the velocity vector-fields are multi-modal/ambiguous. However, since training uses a standard mean-squared-error loss, the learnt velocity vector-field averages ''ground-truth'' directions and isn't multi-modal. In contrast, variational rectified flow matching learns and samples from multi-modal flow directions. We show on synthetic data, MNIST, CIFAR-10, and ImageNet that variational rectified flow matching leads to compelling results.

Paperid:392 poster

Authors:Zhiyong Wang, Jiahang Sun, Mingze Kong, Jize Xie, Qinghua Hu, John C.S. Lui, Zhongxiang Dai

Title: Online Clustering of Dueling Bandits

Abstract:
The contextual multi-armed bandit (MAB) is a widely used framework for problems requiring sequential decision-making under uncertainty, such as recommendation systems. In applications involving a large number of users, the performance of contextual MAB can be significantly improved by facilitating collaboration among multiple users. This has been achieved by the clustering of bandits (CB) methods, which adaptively group the users into different clusters and achieve collaboration by allowing the users in the same cluster to share data. However, classical CB algorithms typically rely on numerical reward feedback, which may not be practical in certain real-world applications. For instance, in recommendation systems, it is more realistic and reliable to solicit preference feedback between pairs of recommended items rather than absolute rewards. To address this limitation, we introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions. Both algorithms are supported by rigorous theoretical analyses, demonstrating that user collaboration leads to improved regret bounds. Extensive empirical evaluations on synthetic and real-world datasets further validate the effectiveness of our methods, establishing their potential in real-world applications involving multiple users with preference-based feedback.

Paperid:393 poster

Authors:Fivos Kalogiannis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Ian Gemp, Georgios Piliouras

Title: Solving Zero-Sum Convex Markov Games

Abstract:
We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two-player zero-sum convex Markov games (cMGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al.(2024), extend Markov decision processes to multi-agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of cMGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon.Our results follow a two-step approach. First, leveraging properties of hidden-convex–hidden-concave functions, we show that a simple nonconvex regularization transforms the min-max optimization problem into a nonconvex–proximal Polyak-Łojasiewicz (NC-pPL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained min-max problems under NC-pPL and two-sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest.

Paperid:394 poster

Authors:Yonathan Efroni, Ben Kretzu, Daniel Jiang, Jalaj Bhandari, Zheqing Zhu, Karen Ullrich

Title: Aligned Multi Objective Optimization

Abstract:
To date, the multi-objective optimization literature has mainly focused on conflicting objectives, studying the Pareto front, or requiring users to balance tradeoffs. Yet, in machine learning practice, there are many scenarios where such conflict does not take place. Recent findings from multi-task learning, reinforcement learning, and LLMs training show that diverse related tasks can enhance performance across objectives simultaneously. Despite this evidence, such phenomenon has not been examined from an optimization perspective. This leads to a lack of generic gradient-based methods that can scale to scenarios with a large number of related objectives. To address this gap, we introduce the Aligned Multi-Objective Optimization framework, propose new algorithms for this setting, and provide theoretical guarantees of its superior performance compared to naive approaches.

Paperid:395 poster

Authors:Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Jing Xiong, Zhiwei Fei, Hui Su, Xiaoyu Shen

Title: SkipGPT: Each Token is One of a Kind

Abstract:
Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1)horizontal dynamics, where token-level heterogeneity demands context-aware pruning decisions, and (2)vertical dynamics, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduceSkipGPT, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at:https://github.com/EIT-NLP/SkipGPT.

Paperid:396 poster

Authors:Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin

Title: Inverse Bridge Matching Distillation

Abstract:
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.

Paperid:397 poster

Authors:Vint Lee, Minh Nguyen, Leena Elzeiny, Chun Deng, Pieter Abbeel, John Wawrzynek

Title: Chip Placement with Diffusion Models

Abstract:
Macro placement is a vital step in digital circuit design that defines the physical location of large collections of components, known as macros, on a 2D chip. Because key performance metrics of the chip are determined by the placement, optimizing it is crucial. Existing learning-based methods typically fall short because of their reliance on reinforcement learning (RL), which is slow and struggles to generalize, requiring online training on each new circuit. Instead, we train a diffusion model capable of placing new circuits zero-shot, using guided sampling in lieu of RL to optimize placement quality. To enable such models to train at scale, we designed a capable yet efficient architecture for the denoising model, and propose a novel algorithm to generate large synthetic datasets for pre-training. To allow zero-shot transfer to real circuits, we empirically study the design decisions of our dataset generation algorithm, and identify several key factors enabling generalization. When trained on our synthetic data, our models generate high-quality placements on unseen, realistic circuits, achieving competitive performance on placement benchmarks compared to state-of-the-art methods.

Paperid:398 poster

Authors:Thomas Chen, Tengyu Ma, Zhiyuan Li

Title: Non-Asymptotic Length Generalization

Abstract:
Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is $2n - 2$ where $n$ is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is $O(T^2)$ when the ground-truth function has precision $T$, and that the length complexity of 2-layer C-RASP functions is $O(T^{O(K)})$ when the ground-truth function has precision $T$ and $K$ heads.

Paperid:399 poster

Authors:Ruth Wan Theng Chew, Quoc Phong Nguyen, Bryan Kian Hsiang Low

Title: BILBO: BILevel Bayesian Optimization

Abstract:
Bilevel optimization is characterized by a two-level optimization structure, where the upper-level problem is constrained by optimal lower-level solutions, and such structures are prevalent in real-world problems. The constraint by optimal lower-level solutions poses significant challenges, especially in noisy, constrained, and derivative-free settings, as repeating lower-level optimizations is sample inefficient and predicted lower-level solutions may be suboptimal. We present BILevel Bayesian Optimization (BILBO), a novel Bayesian optimization algorithm for general bilevel problems with blackbox functions, which optimizes both upper- and lower-level problems simultaneously, without the repeated lower-level optimization required by existing methods. BILBO samples from confidence-bounds based trusted sets, which bounds the suboptimality on the lower level. Moreover, BILBO selects only one function query per iteration, where the function query selection strategy incorporates the uncertainty of estimated lower-level solutions and includes a conditional reassignment of the query to encourage exploration of the lower-level objective. The performance of BILBO is theoretically guaranteed with a sublinear regret bound for commonly used kernels and is empirically evaluated on several synthetic and real-world problems.

Paperid:400 poster

Authors:Tennessee Hickling, Dennis Prangle

Title: Flexible Tails for Normalizing Flows

Abstract:
Normalizing flows are a flexible class of probability distributions, expressed as transformations of a simple base distribution. A limitation of standard normalizing flows is representing distributions with heavy tails, which arise in applications to both density estimation and variational inference. A popular current solution to this problem is to use a heavy tailed base distribution. We argue this can lead to poor performance due to the difficulty of optimising neural networks, such as normalizing flows, under heavy tailed input. We propose an alternative, "tail transform flow'' (TTF), which uses a Gaussian base distribution and a final transformation layer which can produce heavy tails. Experimental results show this approach outperforms current methods, especially when the target distribution has large dimension or tail weight.

Paperid:401 poster

Authors:Xiaoyuan Zhang, Peijie Li, Yingying Yu, Yichi Zhang, Han Zhao, Qingfu Zhang

Title: Multiobjective distribution matching

Abstract:
Distribution matching is a key technique in machine learning, with applications in generative models, domain adaptation, and algorithmic fairness. A related but less explored challenge is generating a distribution that aligns with multiple underlying distributions, often with conflicting objectives, known as a Pareto optimal distribution. In this paper, we develop a general theory based on information geometry to construct the Pareto set and front for the entire exponential family under KL and inverse KL divergences. This formulation allows explicit derivation of the Pareto set and front for multivariate normal distributions, enabling applications like multiobjective variational autoencoders (MOVAEs) to generate interpolated image distributions. Experimental results on real-world images demonstrate that both algorithms can generate high-quality interpolated images across multiple distributions.

Paperid:402 poster

Authors:Jerry Yao-Chieh Hu, Bo-Yu Chen, Dennis Wu, Feng Ruan, Han Liu

Title: Nonparametric Modern Hopfield Models

Abstract:
We present a nonparametric interpretation for deep learning compatible modern Hopfield models and utilize this new perspective to debut efficient variants. Our key contribution stems from interpreting the memory storage and retrieval processes in modern Hopfield models as a nonparametric regression problem subject to a set of query-memory pairs. Interestingly, our framework not only recovers the known results from the original dense modern Hopfield model but also fills the void in the literature regarding efficient modern Hopfield models, by introducingsparse-structuredmodern Hopfield models with sub-quadratic complexity. We establish that this sparse model inherits the appealing theoretical properties of its dense analogue --- connection with transformer attention, fixed point convergence and exponential memory capacity. Additionally, we showcase the versatility of our framework by constructing a family of modern Hopfield models as extensions, including linear, random masked, top-Kand positive random feature modern Hopfield models. Empirically, we validate our framework in both synthetic and realistic settings for memory retrieval and learning tasks.

Paperid:403 poster

Authors:Yu Zhou, Bingxuan Li, Tang Mohan, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng

Title: Contrastive Visual Data Augmentation

Abstract:
Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

Paperid:404 poster

Authors:Kiarash Banihashem, Xiang Chen, MohammadTaghi Hajiaghayi, Sungchul Kim, Kanak Mahadik, Ryan A. Rossi, Tong Yu

Title: Fully Dynamic Embedding intoℓpSpaces

Abstract:
Metric embeddings are fundamental in machine learning, enabling similarity search, dimensionality reduction, and representation learning. They underpin modern architectures like transformers and large language models, facilitating scalable training and improved generalization. Theoretically, the classic problem in embedding design is mapping arbitrary metrics into $\ell_p$ spaces while approximately preserving pairwise distances. We study this problem in a fully dynamic setting, where the underlying metric is a graph metric subject to edge insertions and deletions. Our goal is to maintain an efficient embedding after each update. We present the first fully dynamic algorithm for this problem, achieving $O(\log(n))^{2q} O(\log(nW))^{q-1}$ expected distortion with $O(m^{1/q + o(1)})$ update time and $O(q \log(n) \log(nW))$ query time, where $q \ge 2$ is an integer parameter.

Paperid:405 poster

Authors:Chia-Han Yeh, Tse-Sheng Nan, Risto Vuorio, Wei Hung, Hung Yen Wu, Shao-Hua Sun, Ping-Chun Hsieh

Title: Action-Constrained Imitation Learning

Abstract:
Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications. In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space. The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through trajectory alignment and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency.

Paperid:406 poster

Authors:Gleb Rodionov, Liudmila Prokhorenkova

Title: Discrete Neural Algorithmic Reasoning

Abstract:
Neural algorithmic reasoning aims to capture computations with neural networks by training models to imitate the execution of classic algorithms. While common architectures are expressive enough to contain the correct model in the weights space, current neural reasoners struggle to generalize well on out-of-distribution data. On the other hand, classic computations are not affected by distributional shifts as they can be described as transitions between discrete computational states. In this work, we propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states. To achieve this, we separate discrete and continuous data flows and describe the interaction between them. Trained with supervision on the algorithm's state transitions, such models are able to perfectly align with the original algorithm. To show this, we evaluate our approach on multiple algorithmic problems and achieve perfect test scores both in single-task and multitask setups. Moreover, the proposed architectural choice allows us to prove the correctness of the learned algorithms for any test data.

Paperid:407 poster

Authors:Meir Yossef Levi, Guy Gilboa

Title: The Double-Ellipsoid Geometry of CLIP

Abstract:
Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood, and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.

Paperid:408 poster

Authors:Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie

Title: In-Context Learning and Occam's Razor

Abstract:
A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best—a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning—an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available athttps://github.com/3rdCore/PrequentialCode.

Paperid:409 poster

Authors:Jingyang Qiao, zhizhong zhang, Xin Tan, Yanyun Qu, Shouhong Ding, Yuan Xie

Title: Large Continual Instruction Assistant

Abstract:
Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available athttps://github.com/JingyangQiao/CoIN.

Paperid:410 poster

Authors:Chansophea Wathanak In, Yi Li, David Woodruff, Xuan Wu

Title: Robust Sparsification via Sensitivity

Abstract:
Robustness to outliers is important in machine learning. Many classical problems, including subspace embedding, clustering, and low-rank approximation, lack scalable, outlier-resilient algorithms. This paper considers machine learning problems of the formminx∈RdF(x), whereF(x)=∑i=1nFi(x), and their robust counterpartsminx∈RdF(m)(x), whereF(m)(x)denotes the sum of all but themlargestFi(x)values.We develop a general framework for constructingϵ-coresets for such robust problems, where anϵ-coreset is a weighted subset of functionsF1(x),�?Fn(x)that provides a(1+ϵ)-approximation toF(x). Specifically, if the original problemFhas total sensitivityTand admits a vanillaϵ-coreset of sizeS, our algorithm constructs anϵ-coreset of sizeO~(mTϵ)+Sfor the robust objectiveF(m). This coreset size can be shown to be near-tight for�?subspace embedding. Our coreset algorithm has scalable running time and leads to new or improved algorithms for the robust optimization problems. Empirical evaluations demonstrate that our coresets outperform uniform sampling on real-world data sets.

Paperid:411 poster

Authors:Hansheng Chen, Kai Zhang, Hao Tan, Zexiang Xu, Fujun Luan, Leonidas Guibas, Gordon Wetzstein, Sai Bi

Title: Gaussian Mixture Flow Matching Models

Abstract:
Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in few-step sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG). To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss. We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an $L_2$ denoising loss. For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality. Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256$\times$256.

Paperid:412 poster

Authors:Ruonan Yu, Songhua Liu, Zhenxiong Tan, Xinchao Wang

Title: Ultra-Resolution Adaptation with Ease

Abstract:
Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed URAE. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, i.e., setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are availablehere.

Paperid:413 poster

Authors:Belinda Z. Li, Zifan Carl Guo, Jacob Andreas

Title: (How) Do Language Models Track State?

Abstract:
Transformer language models (LMs) exhibit behaviors—from storytelling to code generation—that seem to require tracking the unobserved state of an evolving world. How do they do this? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the “associative scan�?construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, and then refines this with an associative scan. LMs that learn the former algorithm tend to generalize better and converge faster, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pre-trained or fine-tuned, can learn to implement efficient and interpretable state-tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled. Code and data are available athttps://github.com/belindal/state-tracking

Paperid:414 poster

Authors:Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Victor Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar

Title: Value-Based Deep RL Scales Predictably

Abstract:
Scaling data and compute is critical in modern machine learning. However, scaling also demandspredictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from low compute or low data runs, without ever running the large-scale experiment. In this paper, we show predictability of value-based off-policy deep RL. First, we show that data and compute requirements to reach a given performance level lie on aPareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can extrapolate data requirements into a higher compute regime, and compute requirements into a higher data regime. Second, we determine the optimal allocation of totalbudgetacross data and compute to obtain given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between differenthyperparameters, which is used to counteract effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

Paperid:415 poster

Authors:Chuanhui Liu, Xiao Wang

Title: Censor Dependent Variational Inference

Abstract:
This paper provides a comprehensive analysis of variational inference in latent variable models for survival analysis, emphasizing the distinctive challenges associated with applying variational methods to survival data. We identify a critical weakness in the existing methodology, demonstrating how a poorly designed variational distribution may hinder the objective of survival analysis tasks—modeling time-to-event distributions. We prove that the optimal variational distribution, which perfectly bounds the log-likelihood, may depend on the censoring mechanism. To address this issue, we propose censor-dependent variational inference (CDVI), tailored for latent variable models in survival analysis. More practically, we introduce CD-CVAE, a V-structure Variational Autoencoder (VAE) designed for the scalable implementation of CDVI. Further discussion extends some existing theories and training techniques to survival analysis. Extensive experiments validate our analysis and demonstrate significant improvements in the estimation of individual survival distributions.

Paperid:416 poster

Authors:Nikita Zozoulenko, Thomas Cass, Lukas Gonon

Title: Random Feature Representation Boosting

Abstract:
We introduce Random Feature Representation Boosting (RFRBoost), a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost uses random features at each layer to learn the functional gradient of the network representation, enhancing performance while preserving the convex optimization benefits of RFNNs. In the case of MSE loss, we obtain closed-form solutions to greedy layer-wise boosting with random features. For general loss functions, we show that fitting random feature residual blocks reduces to solving a quadratically constrained least squares problem. Through extensive numerical experiments on tabular datasets for both regression and classification, we show that RFRBoost significantly outperforms RFNNs and end-to-end trained MLP ResNets in the small- to medium-scale regime where RFNNs are typically applied. Moreover, RFRBoost offers substantial computational benefits, and theoretical guarantees stemming from boosting theory.

Paperid:417 poster

Authors:Marco Francis Cusumano-Towner, David Hafner, Alexander Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Kraehenbuehl, Vladlen Koltun

Title: Robust Autonomy Emerges from Self-Play

Abstract:
Self-play has powered breakthroughs in two-player and multi-player games. Here we show that self-play is a surprisingly effective strategy in another domain. We show that robust and naturalistic driving emerges entirely from self-play in simulation at unprecedented scale -- 1.6 billion km of driving. This is enabled by Gigaflow, a batched simulator that can synthesize and train on 42 years of subjective driving experience per hour on a single 8-GPU node. The resulting policy achieves state-of-the-art performance on three independent autonomous driving benchmarks. The policy outperforms the prior state of the art when tested on recorded real-world scenarios, amidst human drivers, without ever seeing human data during training. The policy is realistic when assessed against human references and achieves unprecedented robustness, averaging 17.5 years of continuous driving between incidents in simulation.

Paperid:418 poster

Authors:Ruitao Pu, Yang Qin, Xiaomin Song, Dezhong Peng, Zhenwen Ren, Yuan Sun

Title: SHE: Streaming-media Hashing Retrieval

Abstract:
Recently, numerous cross-modal hashing (CMH) methods have been proposed, yielding remarkable progress. As a static learning paradigm, existing CMH methods often implicitly assume that all modalities are prepared before processing. However, in practice applications (such as multi-modal medical diagnosis), it is very challenging to collect paired multi-modal data simultaneously. Specifically, they are collected chronologically, forming streaming-media data (SMA). To handle this, all previous CMH methods require retraining on data from all modalities, which inevitably limits the scalability and flexibility of the model. In this paper, we propose a novel CMH paradigm named Streaming-media Hashing rEtrieval (SHE) that enables parallel training of each modality. Specifically, we first propose a knowledge library mining module (KLM) that extracts a prototype knowledge library for each modality, thereby revealing the commonality distribution of the instances from each modality. Then, we propose a knowledge library transfer module (KLT) that updates and aligns the new knowledge by utilizing the historical knowledge library, ensuring semantic consistency. Finally, to enhance intra-class semantic relevance and inter-class semantic disparity, we develop a discriminative hashing learning module (DHL). Comprehensive experiments on four benchmark datasets demonstrate the superiority of our SHE compared to 14 competitors.

Paperid:419 poster

Authors:Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee

Title: Probing Visual Language Priors in VLMs

Abstract:
Vision-Language Models (VLMs) may over-rely on visual language priors from their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4o achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data and then apply pixel-level and semantic corruptions to form ``good-bad" image pairs for self-training. Our proposed training objective, Image-DPO, compels VLMs to focus more on the actual visual inputs, and we demonstrate its effectiveness in LLaVA-v1.5 and Cambrian. Project Page: \href{https://vilp-team.github.io/}{ViLP}.

Paperid:420 poster

Authors:Yiğit Narter, Alihan Hüyük, Mihaela van der Schaar, Cem Tekin

Title: Unified Screening for Multiple Diseases

Abstract:
Current screening programs that focus on improving patient health while minimizing screening costs are tailored for individual diseases. Designing unified screening programs for multiple diseases requires carefully balancing competing disease risks, which is an open problem. In this work, we address this problem by casting unified screening as a referral problem, in which we choose to activate a subset of screening policies for individual diseases by accounting for competing risks that influence patient outcomes. We introduce a novel optimization framework that incorporates disease risks, budget constraints, and diagnostic error limits and characterize the structural properties of the optimal referral policy. For the unified screening of two diseases, we show that the optimal activation threshold for the screening of one disease depends on the risk of the other, resulting in decision boundaries with distinct risk-dependent profiles. We compare our unified model with independent screening programs that apply isolated activation thresholds for screening of each disease. Our approach optimizes screening decisions collectively, improving overall survival outcomes, particularly for patients with high disease risks.

Paperid:421 poster

Authors:Oliver Broadrick, Sanyam Agarwal, Guy Van den Broeck, Markus Bläser

Title: The Limits of Tractable Marginalization

Abstract:
Marginalization -- summing a function over all assignments to a subset of its inputs -- is a fundamental computational problem with applications from probabilistic inference to formal verification. Despite its computational hardness in general, there exist many classes of functions (e.g., probabilistic models) for which marginalization remains tractable, and they can all be commonly expressed by arithmetic circuits computing multilinear polynomials. This raises the question, canallfunctions with polynomial time marginalization algorithms be succinctly expressed by such circuits? We give a negative answer, exhibiting simple functions with tractable marginalization yet no efficient representation by known models, assuming $\mathsf{FP} \neq \#\mathsf{P}$ (an assumption implied by $\mathsf{P} \neq \mathsf{NP}$). To this end, we identify a hierarchy of complexity classes corresponding to stronger forms of marginalization, all of which are efficiently computable on the known circuit models. We conclude with a completeness result, showing that whenever there is an efficient real RAM performing virtual evidence marginalization for a function, then there are small arithmetic circuits for that function's multilinear representation.

Paperid:422 poster

Authors:Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li

Title: PENCIL: Long Thoughts with Short Memory

Abstract:
While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems is fundamentally limited by suboptimal memory usage �?intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively clean up intermediate thoughts based on patterns learned from training. By alternately generating and erasing, PENCIL can think deeper to solve harder problems using shorter context and less computes. Empirically, for example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein's puzzle �?a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.

Paperid:423 poster

Authors:Hammad Rizwan, Domenic Rosati, Ga Wu, Hassan Sajjad

Title: Resolving Lexical Bias in Model Editing

Abstract:
Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model's weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods arecritically vulnerableto strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.

Paperid:424 poster

Authors:Christopher Wewer, Bartlomiej Pogodzinski, Bernt Schiele, Jan Eric Lenssen

Title: Spatial Reasoning with Denoising Models

Abstract:
We introduce Spatial Reasoning Models (SRMs), a framework to perform reasoning over sets of continuous variables via denoising generative models. SRMs infer continuous representations on a set of unobserved variables, given observations on observed variables. Current generative models on spatial domains, such as diffusion and flow matching models, often collapse to hallucination in case of complex distributions. To measure this, we introduce a set of benchmark tasks that test the quality of complex reasoning in generative models and can quantify hallucination. The SRM framework allows to report key findings about importance of sequentialization in generation, the associated order, as well as the sampling strategies during training. It demonstrates, for the first time, that order of generation can successfully be predicted by the denoising network itself. Using these findings, we can increase the accuracy of specific reasoning tasks from <1% to >50%. Ourproject websiteprovides additional videos, code, and the benchmark datasets.

Paperid:425 poster

Authors:Conor Snedeker, Xinyu Zhou, Raef Bassily

Title: Private Model Personalization Revisited

Abstract:
We study model personalization under user-level differential privacy (DP) in the shared representation framework. In this problem, there are $n$ users whose data is statistically heterogeneous, and their optimal parameters share an unknown embedding $U^* \in\mathbb{R}^{d\times k}$ that maps the user parameters in $\mathbb{R}^d$ to low-dimensional representations in $\mathbb{R}^k$, where $k\ll d$. Our goal is to privately recover the shared embedding and the local low-dimensional representations with small excess risk in the federated setting. We propose a private, efficient federated learning algorithm to learn the shared embedding based on the FedRep algorithm in (Collins et al., 2021). Unlike (Collins et al., 2021), our algorithm satisfies differential privacy, and our results hold for the case of noisy labels. In contrast to prior work on private model personalization (Jain et al., 2021), our utility guarantees hold under a larger class of users' distributions (sub-Gaussian instead of Gaussian distributions). Additionally, in natural parameter regimes, we improve the privacy error term in (Jain et al., 2021) by a factor of $\widetilde{O}(dk)$. Next, we consider the binary classification setting. We present an information-theoretic construction to privately learn the shared embedding and derive a margin-based accuracy guarantee that is independent of $d$. Our method utilizes the Johnson-Lindenstrauss transform to reduce the effective dimensions of the shared embedding and the users' data. This result shows that dimension-independent risk bounds are possible in this setting under a margin loss.

Paperid:426 poster

Authors:Niccolò Ajroldi, Antonio Orvieto, Jonas Geiping

Title: When, Where and Why to Average Weights?

Abstract:
Averaging checkpoints along the training trajectory is a simple yet powerful approach to improve the generalization performance of Machine Learning models and reduce training time. Motivated by these potential gains, and in an effort to fairly and thoroughly benchmark this technique, we present an extensive evaluation of averaging techniques in modern Deep Learning, which we perform using AlgoPerf, a large-scale benchmark for optimization algorithms. We investigate whether weight averaging can reduce training time, improve generalization, and replace learning rate decay, as suggested by recent literature. Our evaluation across seven architectures and datasets reveals that averaging significantly accelerates training and yields considerable efficiency gains across all considered workloads, at the price of a minimal implementation and memory cost, while mildly improving generalization. Finally, we explore the relationship between averaging and learning rate annealing and show that combining the two achieves optimal performance.

Paperid:427 poster

Authors:Jiali Cheng, Hadi Amiri

Title: Tool Unlearning for Tool-Augmented LLMs

Abstract:
Tool-augmented large language models (LLMs) may need to forget learned tools due to security concerns, privacy restrictions, or deprecated tools. However, ``tool unlearning'' has not been investigated in machine unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete , the first approach for unlearning tools from tool-augmented LLMs which implements three properties for effective tool unlearning, and a new membership inference attack (MIA) model for evaluation. Experiments on three tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns both randomly selected and category-specific tools, while preserving the LLM's knowledge on non-deleted tools and maintaining performance on general tasks.

Paperid:428 poster

Authors:Yunli Wang, ZhenZhang, Zhiqiang Wang, Zixuan Yang, Yu Li, Jian Yang, Shiyang Wen, Peng Jiang, Kun Gai

Title: Learning Cascade Ranking as One Network

Abstract:
Cascade Ranking is a prevalent architecture in large-scale top-k selection systems like recommendation and advertising platforms. Traditional training methods focus on single-stage optimization, neglecting interactions between stages. Recent advances have introduced interaction-aware training paradigms, but still struggle to 1) align training objectives with the goal of the entire cascade ranking (i.e., end-to-end recall of ground-truth items) and 2) learn effective collaboration patterns for different stages. To address these challenges, we propose LCRON, which introduces a novel surrogate loss function derived from the lower bound probability that ground truth items are selected by cascade ranking, ensuring alignment with the overall objective of the system. According to the properties of the derived bound, we further design an auxiliary loss for each stage to drive the reduction of this bound, leading to a more robust and effective top-k selection. LCRON enables end-to-end training of the entire cascade ranking system as a unified network. Experimental results demonstrate that LCRON achieves significant improvement over existing methods on public benchmarks and industrial applications, addressing key limitations in cascade ranking training and significantly enhancing system performance.

Paperid:429 poster

Authors:Grigoriy Ksenofontov, Alexander Korotin

Title: Categorical Schrödinger Bridge Matching

Abstract:
The Schrödinger Bridge (SB) is a powerful framework for solving generative modeling tasks such as unpaired domain translation. Most SB-related research focuses on continuous data space $\mathbb{R}^{D}$ and leaves open theoretical and algorithmic questions about applying SB methods to discrete data, e.g, on finite spaces $\mathbb{S}^{D}$. Notable examples of such sets $\mathbb{S}$ are codebooks of vector-quantized (VQ) representations of modern autoencoders, tokens in texts, categories of atoms in molecules, etc. In this paper, we provide a theoretical and algorithmic foundation for solving SB in discrete spaces using the recently introduced Iterative Markovian Fitting (IMF) procedure. Specifically, we theoretically justify the convergence of discrete-time IMF (D-IMF) to SB in discrete spaces. This enables us to develop a practical computational algorithm for SB, which we call Categorical Schrödinger Bridge Matching (CSBM). We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images. The code of CSBM is available atthis repository.

Paperid:430 poster

Authors:Armando Bellante, Martin Plávala, Alessandro Luongo

Title: The Generalized Skew Spectrum of Graphs

Abstract:
This paper proposes a family of permutation-invariant graph embeddings, generalizing the Skew Spectrum of graphs of Kondor & Borgwardt (2008). Grounded in group theory and harmonic analysis, our method introduces a new class of graph invariants that are isomorphism-invariant and capable of embedding richer graph structures - including attributed graphs, multilayer graphs, and hypergraphs - which the Skew Spectrum could not handle. Our generalization further defines a family of functions that enables a trade-off between computational complexity and expressivity. By applying generalization-preserving heuristics to this family, we improve the Skew Spectrum's expressivity at the same computational cost. We formally prove the invariance of our generalization, demonstrate its improved expressiveness through experiments, and discuss its efficient computation.

Paperid:431 poster

Authors:Amr Alkhatib, Roman Bresson, Henrik Boström, Michalis Vazirgiannis

Title: Prediction via Shapley Value Regression

Abstract:
Shapley values have several desirable, theoretically well-supported, properties for explaining black-box model predictions. Traditionally, Shapley values are computed post-hoc, leading to additional computational cost at inference time. To overcome this, a novel method, called ViaSHAP, is proposed, that learns a function to compute Shapley values, from which the predictions can be derived directly by summation. Two approaches to implement the proposed method are explored; one based on the universal approximation theorem and the other on the Kolmogorov-Arnold representation theorem. Results from a large-scale empirical investigation are presented, showing that ViaSHAP using Kolmogorov-Arnold Networks performs on par with state-of-the-art algorithms for tabular data. It is also shown that the explanations of ViaSHAP are significantly more accurate than the popular approximator FastSHAP on both tabular data and images.

Paperid:432 poster

Authors:Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, YINGWEI MA, Jiaheng Zhang, Bryan Hooi

Title: FlipAttack: Jailbreak LLMs via Flipping

Abstract:
This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when the perturbation is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing a left-side perturbation merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task and then develop 4 variants to guide LLMs to understand and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves $\sim$78.97% attack success rate across 8 LLMs on average and $\sim$98% bypass rate against 5 guard models on average.

Paperid:433 poster

Authors:Omri Ben Hemo, Alon Zolfi, Oryan Yehezkel, Omer Hofman, Roman Vainshtein, Hisashi Kojima, Yuval Elovici, Asaf Shabtai

Title: Gradient Inversion of Multimodal Models

Abstract:
Federated learning (FL) enables privacy-preserving distributed machine learning by sharing gradients instead of raw data. However, FL remains vulnerable to gradient inversion attacks, in which shared gradients can reveal sensitive training data. Prior research has mainly concentrated on unimodal tasks, particularly image classification, examining the reconstruction of single-modality data, and analyzing privacy vulnerabilities in these relatively simple scenarios. As multimodal models are increasingly used to address complex vision-language tasks, it becomes essential to assess the privacy risks inherent in these architectures. In this paper, we explore gradient inversion attacks targeting multimodal vision-language Document Visual Question Answering (DQA) models and propose GI-DQA, a novel method that reconstructs private document content from gradients. Through extensive evaluation on state-of-the-art DQA models, our approach exposes critical privacy vulnerabilities and highlights the urgent need for robust defenses to secure multimodal FL systems.

Paperid:434 poster

Authors:Boaz Taitler, Omer Ben-Porat

Title: Selective Response Strategies for GenAI

Abstract:
The rise of Generative AI (GenAI) has significantly impacted human-based forums like Stack Overflow, which are essential for generating high-quality data. This creates a negative feedback loop, hindering the development of GenAI systems, which rely on such data to provide accurate responses. In this paper, we provide a possible remedy: A novel strategy we call selective response. Selective response implies that GenAI could strategically provide inaccurate (or conservative) responses to queries involving emerging topics and novel technologies, thereby driving users to use human-based forums. We show that selective response can potentially have a compounding effect on the data generation process, increasing both GenAI's revenue and user welfare in the long term. From an algorithmic perspective, we propose an approximately optimal approach to maximize GenAI's revenue under social welfare constraints. From a regulatory perspective, we derive sufficient and necessary conditions for selective response to improve welfare.

Paperid:435 poster

Authors:Mingjie Sun, Yida Yin, Zhiqiu Xu, J Zico Kolter, Zhuang Liu

Title: Idiosyncrasies in Large Language Models

Abstract:
In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, including training on synthetic data, inferring model similarity, and robust evaluation of LLMs.

Paperid:436 poster

Authors:Ziliang Chen, Zhao-Rong Lai, Yufeng Yang, Liangda Fang, ZHANFU YANG, Liang Lin

Title: Language Models as Implicit Tree Search

Abstract:
Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with AlphaZero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to AlphaZero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.

Paperid:437 poster

Authors:Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Zhuoqing Mao

Title: Compute or Load KV Cache? Why Not Both?

Abstract:
Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6× reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.

Paperid:438 poster

Authors:Joery A. de Vries, Jinke He, Yaniv Oren, Matthijs T. J. Spaan

Title: Trust-Region Twisted Policy Improvement

Abstract:
Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to ourTrust-Region TwistedSMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Paperid:439 poster

Authors:Kajetan Schweighofer, Adrian Arnaiz-Rodriguez, Sepp Hochreiter, Nuria M Oliver

Title: The Disparate Benefits of Deep Ensembles

Abstract:
Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness examines how a model's performance varies across socially relevant groups defined by protected attributes such as age, gender, or race. In this work, we explore the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups, a phenomenon that we term the disparate benefits effect. We empirically investigate this effect using popular facial analysis and medical imaging datasets with protected group attributes and find that it affects multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify that the per-group differences in predictive diversity of ensemble members can explain this effect. Finally, we demonstrate that the classical Hardt post-processing method is particularly effective at mitigating the disparate benefits effect of Deep Ensembles by leveraging their better-calibrated predictive distributions.

Paperid:440 poster

Authors:Cheng Zhang, Hanna Foerster, Robert D. Mullins, Yiren Zhao, Ilia Shumailov

Title: Hardware and Software Platform Inference

Abstract:
It is now a common business practice to buy access to large language model (LLM) inference rather than self-host, because of significant upfront hardware infrastructure and energy costs. However, as a buyer, there is no mechanism to verify the authenticity of the advertised service including the serving hardware platform, e.g. that it is actually being served using an NVIDIA H100. Furthermore, there are reports suggesting that model providers may deliver models that differ slightly from the advertised ones, often to make them run on less expensive hardware. That way, a client pays premium for a capable model access on more expensive hardware, yet ends up being served by a (potentially less capable) cheaper model on cheaper hardware. In this paper we introducehardware and software platform inference (HSPI)-- a method for identifying the underlying GPU architecture and software stack of a (black-box) machine learning model solely based on its input-output behavior. Our method leverages the inherent differences of various GPU architectures and compilers to distinguish between different GPU types and software stacks. By analyzing the numerical patterns in the model's outputs, we propose a classification framework capable of accurately identifying the GPU used for model inference as well as the underlying software configuration. Our findings demonstrate the feasibility of inferring GPU type from black-box models. We evaluate HSPI against models served on different real hardware and find that in a white-box setting we can distinguish between different GPUs with between 83.9% and 100% accuracy. Even in a black-box setting we are able to achieve results that are up to three times higher than random guess accuracy.

Paperid:441 poster

Authors:Nikolaj Banke Jensen, Jamie Vicary

Title: Enforcing Idempotency in Neural Networks

Abstract:
In this work, we propose a new architecture-agnostic method for training idempotent neural networks. An idempotent operator satisfies $f(x) = f(f(x))$, meaning it can be applied iteratively with no effect beyond the first application. Some neural networks used in data transformation tasks, such as image generation and augmentation, can represent non-linear idempotent projections. Using methods from perturbation theory we derive the recurrence relation ${\mathbf{K}' \leftarrow 3\mathbf{K}^2 - 2\mathbf{K}^3}$ for iteratively projecting a real-valued matrix $\mathbf{K}$ onto the manifold of idempotent matrices. Our analysis shows that for linear, single-layer MLP networks this projection 1) has idempotent fixed points, and 2) is attracting only around idempotent points. We give an extension to non-linear networks by considering our approach as a substitution of the gradient for the canonical loss function, achieving an architecture-agnostic training scheme. We provide experimental results for MLP- and CNN-based architectures with significant improvement in idempotent error over the canonical gradient-based approach. Finally, we demonstrate practical applications of the method as we train a generative network successfully using only a simple reconstruction loss paired with our method.

Paperid:442 poster

Authors:Gaurav Menghani, Ravi Kumar, Sanjiv Kumar

Title: LAuReL: Learned Augmented Residual Layer

Abstract:
One of the core pillars of efficient deep learning methods are architectural improvements, such as residual/skip connections, which have led to significantly better model convergence and quality. Since their introduction, residual connections have become ubiquitous not only in convolutional neural networks but also in transformer-based architectures, the backbone of LLMs.In this paper, we introduce the Learned Augmented Residual Layer (LAuReL) --- a novel generalization of the canonical residual connection --- designed to serve as an in-situ replacement while outperforming it in both model quality and footprint metrics. Our experiments show that LAuReL can enhance quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count.For example, on the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using2.6×fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2.54% to 20.05%, while adding only 0.012% and 0.1% additional parameters, respectively.

Paperid:443 poster

Authors:Hiroki Yanagisawa, Shunta Akiyama

Title: Survival Analysis via Density Estimation

Abstract:
This paper introduces a novel framework for survival analysis by reinterpreting it as a form of density estimation. Our algorithm post-processes density estimation outputs to derive survival functions, enabling the application of any density estimation model to effectively estimate survival functions. This approach broadens the toolkit for survival analysis and enhances the flexibility and applicability of existing techniques. Our framework is versatile enough to handle various survival analysis scenarios, including competing risk models for multiple event types. It can also address dependent censoring when prior knowledge of the dependency between event time and censoring time is available in the form of a copula. In the absence of such information, our framework can estimate the upper and lower bounds of survival functions, accounting for the associated uncertainty.

Paperid:444 poster

Authors:Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon

Title: FLAM: Frame-Wise Language-Audio Modeling

Abstract:
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Paperid:445 poster

Authors:Florence Regol, Leo Schwinn, Kyle Sprague, Mark Coates, Thomas Markovich

Title: When to retrain a machine learning model

Abstract:
A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of data. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments, addressing classification tasks, show that the method consistently outperforms existing baselines on 7 datasets. We thoroughly assess its robustness to varying cost trade-off values and mis-specified cost trade-offs.

Paperid:446 poster

Authors:Roy Uziel, Irit Chelly, Oren Freifeld, Ari Pakman

Title: Clustering via Self-Supervised Diffusion

Abstract:
Diffusion models, widely recognized for their success in generative tasks, have not yet been applied to clustering. We introduce Clustering via Diffusion (CLUDI), a self-supervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher–student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions.

Paperid:447 poster

Authors:Emanuel Ben Baruch, Adam Botach, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni

Title: Distilling the Knowledge in Data Pruning

Abstract:
With the increasing size of datasets used for training neural networks, data pruning has gained traction in recent years. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student's may improve results. Our code will be made available.

Paperid:448 poster

Authors:Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang, Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sainbayar Sukhbaatar, Jason E Weston, Jane Yu

Title: Self-Consistency Preference Optimization

Abstract:
Self-alignment, whereby models learn to improve themselves without human annotation, is a rapidly growing research area. However, existing techniques often fail to improve complex reasoning tasks due to the difficulty of assigning correct rewards. An orthogonal approach that is known to improve correctness is self-consistency, a method applied at inference time based on multiple sampling in order to find the most consistent answer. In this work, we extend the self-consistency concept to help train models. We thus introduce self-consistency preference optimization (ScPO), which iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. We show ScPO leads to large improvements over conventional reward model training on reasoning tasks such as GSM8K and MATH, closing the gap with supervised training with gold answers or preferences, and that combining ScPO with standard supervised learning improves results even further. On ZebraLogic, ScPO finetunes Llama-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.

Paperid:449 poster

Authors:Hyeonah Kim, Sanghyeok Choi, Jiwoo Son, Jinkyoo Park, Changhyun Kwon

Title: Neural Genetic Search in Discrete Spaces

Abstract:
Effective search methods are crucial for improving the performance of deep generative models at test time. In this paper, we introduce a novel test-time search method, Neural Genetic Search (NGS), which incorporates the evolutionary mechanism of genetic algorithms into the generation procedure of deep models. The core idea behind NGS is its crossover, which is defined as parent-conditioned generation using trained generative models. This approach offers a versatile and easy-to-implement search algorithm for deep generative models. We demonstrate the effectiveness and flexibility of NGS through experiments across three distinct domains: routing problems, adversarial prompt generation for language models, and molecular design.

Paperid:450 poster

Authors:Johannes Hertrich, Robert Gruhlke

Title: Importance Corrected Neural JKO Sampling

Abstract:
In order to sample from an unnormalized probability density function, we propose to combine continuous normalizing flows (CNFs) with rejection-resampling steps based on importance weights. We relate the iterative training of CNFs with regularized velocity fields to a JKO scheme and prove convergence of the involved velocity fields to the velocity field of the Wasserstein gradient flow (WGF). The alternation of local flow steps and non-local rejection-resampling steps allows to overcome local minima or slow convergence of the WGF for multimodal distributions. Since the proposal of the rejection step is generated by the model itself, they do not suffer from common drawbacks of classical rejection schemes. The arising model can be trained iteratively, reduces the reverse Kullback-Leibler (KL) loss function in each step, allows to generate iid samples and moreover allows for evaluations of the generated underlying density. Numerical examples show that our method yields accurate results on various test distributions including high-dimensional multimodal targets and outperforms the state of the art in almost all cases significantly.

Paperid:451 poster

Authors:Prakash Palanivelu Rajmohan, Fred Roosta

Title: Importance Sampling for Nonlinear Models

Abstract:
While norm-based and leverage-score-based methods have been extensively studied for identifying "important" data points in linear models, analogous tools for nonlinear models remain significantly underdeveloped. By introducing the concept of the adjoint operator of a nonlinear map, we address this gap and generalize norm-based and leverage-score-based importance sampling to nonlinear settings. We demonstrate that sampling based on these generalized notions of norm and leverage scores provides approximation guarantees for the underlying nonlinear mapping, similar to linear subspace embeddings. As direct applications, these nonlinear scores not only reduce the computational complexity of training nonlinear models by enabling efficient sampling over large datasets but also offer a novel mechanism for model explainability and outlier detection. Our contributions are supported by both theoretical analyses and experimental results across a variety of supervised learning scenarios.

Paperid:452 poster

Authors:Yonatan Sommer, Ivri Hikri, Lotan Amit, Nir Rosenfeld

Title: Learning Classifiers That Induce Markets

Abstract:
When learning is used to inform decisions about humans, such as for loans, hiring, or admissions, this can incentivize users to strategically modify their features, at a cost, to obtain positive predictions. The common assumption is that the function governing costs is exogenous, fixed, and predetermined. We challenge this assumption, and assert that costs emerge as aresultof deploying a classifier. Our idea is simple: when users seek positive predictions, this creates demand for important features; and if features are available for purchase, then a market will form, and competition will give rise to prices. We extend the strategic classification framework to support this notion, and study learning in a setting where a classifier can induce a market for features. We present an analysis of the learning task, devise an algorithm for computing market prices, propose a differentiable learning framework, and conduct experiments to explore our novel setting and approach.

Paperid:453 poster

Authors:Anastasia Koloskova, Youssef Allouah, Animesh Jha, Rachid Guerraoui, Sanmi Koyejo

Title: Certified Unlearning for Neural Networks

Abstract:
We address the problem of machine unlearning, where the goal is to remove the influence of specific training data from a model upon request, motivated by privacy concerns and regulatory requirements such as the “right to be forgotten.�?Unfortunately, existing methods rely on restrictive assumptions or lack formal guarantees. To this end, we propose a novel method for certified machine unlearning, leveraging the connection between unlearning and privacy amplification by stochastic post-processing. Our method uses noisy fine-tuning on the retain data, i.e., data that does not need to be removed, to ensure provable unlearning guarantees. This approach requires no assumptions about the underlying loss function, making it broadly applicable across diverse settings. We analyze the theoretical trade-offs in efficiency and accuracy and demonstrate empirically that our method not only achieves formal unlearning guarantees but also performs effectively in practice, outperforming existing baselines.

Paperid:454 poster

Authors:Benjamin Fuhrer, Chen Tessler, Gal Dalal

Title: Gradient Boosting Reinforcement Learning

Abstract:
We present Gradient Boosting Reinforcement Learning (GBRL), a framework that adapts the strengths of Gradient Boosting Trees (GBT) to reinforcement learning (RL) tasks. While neural networks (NNs) have become the de facto choice for RL, they face significant challenges with structured and categorical features and tend to generalize poorly to out-of-distribution samples. These are challenges for which GBTs have traditionally excelled in supervised learning. However, GBT's application in RL has been limited. The design of traditional GBT libraries is optimized for static datasets with fixed labels, making them incompatible with RL's dynamic nature, where both state distributions and reward signals evolve during training. GBRL overcomes this limitation by continuously interleaving tree construction with environment interaction. Through extensive experiments, we demonstrate that GBRL outperforms NNs in domains with structured observations and categorical features, while maintaining competitive performance on standard continuous control benchmarks. Like its supervised learning counterpart, GBRL demonstrates superior robustness to out-of-distribution samples and better handles irregular state-action relationships.

Paperid:455 poster

Authors:Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Liping Liu

Title: Graph Generative Pre-trained Transformer

Abstract:
Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.

Paperid:456 poster

Authors:Noam Itzhak Levi

Title: A Simple Model of Inference Scaling Laws

Abstract:
Neural scaling laws have garnered significant interest due to their ability to predict model performance as a function of increasing parameters, data, and compute. In this work, we propose a simple statistical ansatz based on memorization to study scaling laws in the context of inference. Specifically, how performance improves with multiple inference attempts. We explore the coverage, or pass@k metric, which measures the chance of success over repeated attempts and provide a motivation for the observed functional form of the inference scaling behavior of the coverage in large language models (LLMs) on reasoning tasks. We then define an "inference loss", which exhibits a power law decay as the number of trials increases, and connect this result with prompting costs. We further test the universality of our construction by conducting experiments on a simple generative model, and find that our predictions are in agreement with the empirical coverage curves in a controlled setting. Our simple framework sets the ground for incorporating inference scaling with other known scaling laws.

Paperid:457 poster

Authors:Karthik Sridharan, Seung Won Wilson Yoo

Title: Online Learning with Unknown Constraints

Abstract:
We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general meta-algorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm's regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that characterizes the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with $\sqrt{T}$ regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction.

Paperid:458 poster

Authors:Vitaly Feldman, Audra McMillan, Guy N. Rothblum, Kunal Talwar

Title: Local Pan-privacy for Federated Analytics

Abstract:
Pan-privacy was proposed by Dwork et al. (2010) as an approach to designing a private analytics system that retains its privacy properties in the face of intrusions that expose the system's internal state. Motivated by Federated telemetry applications, we study {\em local pan-privacy}, where privacy should be retained under repeated unannounced intrusions {\em on the local state}. We consider the problem of monitoring the count of an event in a federated system, where event occurrences on a local device should be hidden even from an intruder on that device. We show that under reasonable constraints, the goal of providing information-theoretic differential privacy under intrusion is incompatible with collecting telemetry information. We then show that this problem can be solved in a scalable way using standard cryptographic primitives.

Paperid:459 poster

Authors:Michael Samuel Albergo, Eric Vanden-Eijnden

Title: NETS: A Non-equilibrium Transport Sampler

Abstract:
We introduce the Non-Equilibrium Transport Sampler (NETS), an algorithm for sampling from unnormalized probability distributions. NETS builds on non-equilibrium sampling strategies that transport a simple base distribution into the target distribution in finite time, as pioneered in Neal's annealed importance sampling (AIS). In the continuous-time setting, this transport is accomplished by evolving walkers using Langevin dynamics with a time-dependent potential, while simultaneously evolving importance weights to debias their solutions following Jarzynski's equality. The key innovation of NETS is to add to the dynamics a learned drift term that offsets the need for these corrective weights by minimizing their variance through an objective that can be estimated without backpropagation and provably bounds the Kullback-Leibler divergence between the estimated and target distributions. NETS provides unbiased samples and features a tunable diffusion coefficient that can be adjusted after training to maximize the effective sample size. In experiments on standard benchmarks, high-dimensional Gaussian mixtures, and statistical lattice field theory models, NETS shows compelling performances.

Paperid:460 poster

Authors:Rahul Sharma, Sumantrak Mukherjee, Andrea Sipka, Eyke Hüllermeier, Sebastian Josef Vollmer, Sergey Redyuk, David Antony Selby

Title: X-Hacking: The Threat of Misguided AutoML

Abstract:
Explainable AI (XAI) and interpretable machine learning methods help to build trust in model predictions and derived insights, yet also present a perverse incentive for analysts to manipulate XAI metrics to support pre-specified conclusions. This paper introduces the concept of X-hacking, a form of p-hacking applied to XAI metrics such as Shap values. We show how easily an automated machine learning pipeline can be adapted to exploit model multiplicity at scale: searching a set of ‘defensible�?models with similar predictive performance to find a desired explanation. We formulate the trade-off between explanation and accuracy as a multi-objective optimisation problem, and illustrate empirically on familiar real-world datasets that, on average, Bayesian optimisation accelerates X-hacking 3-fold for features susceptible to it, versus random sampling. We show the vulnerability of a dataset to X-hacking can be determined by information redundancy among features. Finally, we suggest possible methods for detection and prevention, and discuss ethical implications for the credibility and reproducibility of XAI.

Paperid:461 poster

Authors:Marco Bagatella, Jonas Hübotter, Georg Martius, Andreas Krause

Title: Active Fine-Tuning of Multi-Task Policies

Abstract:
Pre-trained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks. This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning. However, as soon as several tasks need to be learned, we must decidewhich tasks should be demonstrated and how often?We study this multi-task problem and explore an interactive framework in which the agentadaptivelyselects the tasks to be demonstrated. We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy. We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.

Paperid:462 poster

Authors:Shuo He, Zhifang Zhang, Feng Liu, Roy Ka-Wei Lee, Bo An, Lei Feng

Title: A Closer Look at Backdoor Attacks on CLIP

Abstract:
We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images. Specifically, based on the methodology of representation decomposing, image representations can be decomposed into a sum of representations across individual image patches, attention heads (AHs), and multi-layer perceptrons (MLPs) in different model layers. By examining the effect of backdoor attacks on model components, we have the following empirical findings. (1) Different backdoor attacks would infect different model components, i.e., local patch-based backdoor attacks mainly affect AHs, while global perturbation-based backdoor attacks mainly affect MLPs. (2) Infected AHs are centered on the last layer, while infected MLPs are decentralized on several late layers. (3) Not all AHs in the last layer are infected and even some AHs could still maintain the original property-specific roles (e.g., ''color" and ''location''). These observations motivate us to defend against backdoor attacks by detecting infected AHs, repairing their representations, or filtering backdoor samples with too many infected AHs, in the inference stage. Experimental results validate our empirical findings and demonstrate the effectiveness of the defense methods.

Paperid:463 poster

Authors:Christopher Scarvelis, David Benhaim, Paul Zhang

Title: Symmetry-Robust 3D Orientation Estimation

Abstract:
Orientation estimation is a fundamental task in 3D shape analysis which consists of estimating a shape's orientation axes: its side-, up-, and front-axes. Using this data, one can rotate a shape into canonical orientation, where its orientation axes are aligned with the coordinate axes. Developing an orientation algorithm that reliably estimates complete orientations of general shapes remains an open problem. We introduce a two-stage orientation pipeline that achieves state of the art performance on up-axis estimation and further demonstrate its efficacy on full-orientation estimation, where one seeks all three orientation axes. Unlike previous work, we train and evaluate our method on all of Shapenet rather than a subset of classes. We motivate our engineering contributions by theory describing fundamental obstacles to orientation estimation for rotationally-symmetric shapes, and show how our method avoids these obstacles.

Paperid:464 poster

Authors:Nicholas George Bishop, Daniel Jarne Ornia, Joel Dyer, Ani Calinescu, Michael J. Wooldridge

Title: Learning Likelihood-Free Reference Priors

Abstract:
Simulation modeling offers a flexible approach to constructing high-fidelity synthetic representations of complex real-world systems. However, the increased complexity of such models introduces additional complications, for example when carrying out statistical inference procedures. This has motivated a large and growing literature onlikelihood-freeorsimulation-basedinference methods, which approximate (e.g., Bayesian) inference without assuming access to the simulator's intractable likelihood function. A hitherto neglected problem in the simulation-based Bayesian inference literature is the challenge of constructing minimally informativereference priorsfor complex simulation models. Such priors maximise an expected Kullback-Leibler distance from the prior to the posterior, thereby influencing posterior inferences minimally and enabling an ``objective'' approach to Bayesian inference that does not necessitate the incorporation of strong subjective prior beliefs. In this paper, we propose and test a selection of likelihood-free methods for learning reference priors for simulation models, using variational approximations to these priors and a variety of mutual information estimators. Our experiments demonstrate that good approximations to reference priors for simulation models are in this way attainable, providing a first step towards the development of likelihood-free objective Bayesian inference procedures.

Paperid:465 poster

Authors:Taeckyung Lee, Sorn Chottananurak, Junsu Kim, Jinwoo Shin, Taesik Gong, Sung-Ju Lee

Title: Test-Time Adaptation with Binary Feedback

Abstract:
Deep learning models perform poorly when domain shifts exist between training and test data. Test-time adaptation (TTA) is a paradigm to mitigate this issue by adapting pre-trained models using only unlabeled test samples. However, existing TTA methods can fail under severe domain shifts, while recent active TTA approaches requiring full-class labels are impractical due to high labeling costs. To address this issue, we introduce a new setting of TTA with binary feedback, which uses a few binary feedbacks from annotators to indicate whether model predictions are correct, thereby significantly reducing the labeling burden of annotators. Under the setting, we propose BiTTA, a novel dual-path optimization framework that leverages reinforcement learning to balance binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments show BiTTA achieves substantial accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort.

Paperid:466 poster

Authors:Henrik Schopmans, Pascal Friederich

Title: Temperature-Annealed Boltzmann Generators

Abstract:
Efficient sampling of unnormalized probability densities such as the Boltzmann distribution of molecular systems is a longstanding challenge. Next to conventional approaches like molecular dynamics or Markov chain Monte Carlo, variational approaches, such as training normalizing flows with the reverse Kullback-Leibler divergence, have been introduced. However, such methods are prone to mode collapse and often do not learn to sample the full configurational space. Here, we present temperature-annealed Boltzmann generators (TA-BG) to address this challenge. First, we demonstrate that training a normalizing flow with the reverse Kullback-Leibler divergence at high temperatures is possible without mode collapse. Furthermore, we introduce a reweighting-based training objective to anneal the distribution to lower target temperatures. We apply this methodology to three molecular systems of increasing complexity and, compared to the baseline, achieve better results in almost all metrics while requiring up to three times fewer target energy evaluations. For the largest system, our approach is the only method that accurately resolves the metastable states of the system.

Paperid:467 poster

Authors:Zilong Wang, Zhiyao Zhang, Shuai Li

Title: Optimal Algorithm for Max-Min Fair Bandit

Abstract:
Multi-player multi-armed bandit (MP-MAB) has been widely studied owing to its diverse applications across numerous domains. We consider an MP-MAB problem whereNplayers compete forKarms inTrounds. The reward distributions are heterogeneous where each player has a different expected reward for the same arm. When multiple players select the same arm, they collide and obtain zero rewards. In this paper, our target is to find the max-min fairness matching that maximizes the reward of the player who receives the lowest reward. This paper improves the existing max-min regret upper bound ofO(exp�?1/Δ)+K3log⁡Tlog⁡log⁡T). More specifically, our decentralized fair elimination algorithm (DFE) deals with heterogeneity and collision carefully and attains a regret upper bound ofO((N2+K)log⁡T/Δ), whereΔis the minimum reward gap between max-min value and sub-optimal arms. In addition, this paper also provides anΩ(max{N2,K}log⁡T/Δ)regret lower bound for this problem, which indicates that our algorithm is optimal with respect to key parametersT,N,K, andΔ. Additional numerical experiments also show the efficiency and improvement of our algorithms.

Paperid:468 poster

Authors:Martino Bernasconi, Andrea Celli, Riccardo Colini Baldeschi, Federico Fusco, Stefano Leonardi, Matteo Russo

Title: Online Learning in the Random-Order Model

Abstract:
In the random-order model for online learning, the sequence of losses is chosen upfront by an adversary and presented to the learner after a random permutation. Any random-order input isasymptoticallyequivalent to a stochastic i.i.d.~one, but, for finite times, it may exhibit significantnon-stationarity, which can hinder the performance of stochastic learning algorithms.While algorithms for adversarial inputs naturally maintain their regret guarantees in random order, simple no-regret algorithms exist for the stochastic model that fail against random-order instances.In this paper, we propose a general procedure to adapt stochastic learning algorithms to the random-order model without substantially affecting their regret guarantees. This allows us to recover improved regret bounds for prediction with delays, bandits with switching costs, and online learning with constraints. Finally, we investigate online classification and prove that, in random order, learnability is characterized by the VC dimension rather than by the Littlestone dimension, thus providing a further separation from the general adversarial model.

Paperid:469 poster

Authors:Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian

Title: Measuring Diversity in Synthetic Datasets

Abstract:
Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets—an aspect crucial for robust model performance—remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at:https://github.com/bluewhalelab/dcscore.

Paperid:470 poster

Authors:Guner Dilsad ER, Sebastian Trimpe, Michael Muehlebach

Title: Distributed Event-Based Learning via ADMM

Abstract:
We consider a distributed learning problem, where agents minimize a global objective function by exchanging information over a network. Our approach has two distinct features: (i) It substantially reduces communication by triggering communication only when necessary, and (ii) it is agnostic to the data-distribution among the different agents. We can therefore guarantee convergence even if the local data-distributions of the agents are arbitrarily distinct. We analyze the convergence rate of the algorithm both in convex and nonconvex settings and derive accelerated convergence rates in a convex setting. We also characterize the effect of communication failures and demonstrate that our algorithm is robust to communication failures. The article concludes by presenting numerical results from distributed learning tasks on the MNIST and CIFAR-10 datasets. The experiments underline communication savings of 35% or more due to the event-based communication strategy, show resilience towards heterogeneous data-distributions, and highlight that our approach outperforms common baselines such as FedAvg, FedProx, SCAFFOLD and FedADMM.

Paperid:471 poster

Authors:Long Zhao, Sanghyun Woo, Ziyu Wan, YANDONG LI, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu

Title: Epsilon-VAE: Denoising as Visual Decoding

Abstract:
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approaches. By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality, which in turn enhances downstream generation quality by 22% at the same compression rates or provides 2.3x inference speedup through increasing compression rates. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.

Paperid:472 poster

Authors:Bilgehan Sel, Lifu Huang, Naren Ramakrishnan, Ruoxi Jia, Ming Jin

Title: LLMs Can Reason Faster Only If We Let Them

Abstract:
Large language models (LLMs) are making inroads into classical AI problems such as automated planning, yet key shortcomings continue to hamper their integration. Chain-of-Thought (CoT) struggles in complex multi-step reasoning, and Tree-of-Thoughts requires multiple queries that increase computational overhead. Recently, Algorithm-of-Thoughts (AoT) have shown promise using in-context examples, at the cost of significantly longer solutions compared to CoT. Aimed at bridging the solution length gap between CoT and AoT, this paper introduces AoT-O3, which combines supervised finetuning on AoT-style plans with a reinforcement learning (RL) framework designed to reduce solution length. The RL component uses a reward model that favors concise, valid solutions while maintaining planning accuracy. Empirical evaluations indicate that AoT-O3 shortens solution length by up to 80% compared to baseline AoT while maintaining or surpassing prior performance. These findings suggest a promising pathway for more efficient, scalable LLM-based planning.

Paperid:473 poster

Authors:Ignacio Peis, Batuhan Koyuncu, Isabel Valera, Jes Frellsen

Title: Hyper-Transforming Latent Diffusion Models

Abstract:
We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming—a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining. We validate our approach across multiple modalities, demonstrating improved scalability, expressiveness, and generalization over existing INR-based generative models. Our findings establish a unified and flexible framework for learning structured function representations.

Paperid:474 poster

Authors:Manu Bhat, Jonghyun Park, Jianke Yang, Nima Dehmamy, Robin Walters, Rose Yu

Title: AtlasD: Automatic Local Symmetry Discovery

Abstract:
Existing symmetry discovery methods predominantly focus on global transformations across the entire system or space, but they fail to consider the symmetries in local neighborhoods. This may result in the reported symmetry group being a misrepresentation of the true symmetry. In this paper, we formalize the notion of local symmetry as atlas equivariance. Our proposed pipeline, automatic local symmetry discovery (AtlasD), recovers the local symmetries of a function by training local predictor networks and then learning a Lie group basis to which the predictors are equivariant. We demonstrate AtlasD is capable of discovering local symmetry groups with multiple connected components in top-quark tagging and partial differential equation experiments. The discovered local symmetry is shown to be a useful inductive bias that improves the performance of downstream tasks in climate segmentation and vision tasks. Our code is publicly available athttps://github.com/Rose-STL-Lab/AtlasD.

Paperid:475 poster

Authors:Elliot Myunghoon Kim, Avi Garg, Kenny Peng, Nikhil Garg

Abstract:
Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ \textit{meaningfully}. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors---on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring---the latter reflecting theoretical predictions regarding algorithmic monoculture.

Paperid:476 poster

Authors:Helia Niaparast, Benjamin Moseley, Karan Singh

Title: Faster Global Minimum Cut with Predictions

Abstract:
Global minimum cut is a fundamental combinatorial optimization problem with wide-ranging applications. Often in practice, these problems are solved repeatedly on families of similar or related instances. However, the de facto algorithmic approach is to solve each instance of the problem from scratch discarding information from prior instances.In this paper, we consider how predictions informed by prior instances can be used to warm-start practical minimum cut algorithms. The paper considers the widely used Karger's algorithm and its counterpart, the Karger-Stein algorithm. Given good predictions, we show these algorithms become near-linear time and have robust performance to erroneous predictions. Both of these algorithms are randomized edge-contraction algorithms. Our natural idea is to probabilistically prioritize the contraction of edges that are unlikely to be in the minimum cut.

Paperid:477 poster

Authors:Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, Hamed Hassani

Title: Adversarial Reasoning at Jailbreaking Time

Abstract:
As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

Paperid:478 poster

Authors:Soumyabrata Kundu, Risi Kondor

Title: Steerable Transformers for Volumetric Data

Abstract:
We introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group $\mathrm{SE}(d)$. We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions. Operating in Fourier space, our network utilizes Fourier space non-linearities. Our experiments in both two and three dimensions show that adding steerable transformer layers to steerable convolutional networks enhances performance.

Paperid:479 poster

Authors:Shahaf Bassan, Guy Amir, Meirav Zehavi, Guy Katz

Title: What makes an Ensemble (Un) Interpretable?

Abstract:
Ensemble models are widely recognized in the ML community for their limited interpretability. For instance, while a single decision tree is considered interpretable, ensembles of trees (e.g., boosted trees) are often treated as black-boxes. Despite this folklore recognition, there remains a lack of rigorous mathematical understanding of what particularly makes an ensemble (un)-interpretable, including how fundamental factors like the (1)number, (2)size, and (3)typeof base models influence its interpretability. In this work, we seek to bridge this gap by applying concepts from computational complexity theory to study the challenges of generating explanations for various ensemble configurations. Our analysis uncovers nuanced complexity patterns influenced by various factors. For example, we demonstrate that under standard complexity assumptions like P$\neq$NP, interpreting ensembles remains intractable even when base models are of constant size. Surprisingly, the complexity changes drastically with the number of base models: small ensembles of decision trees are efficiently interpretable, whereas ensembles of linear models remain intractable, even with a constant number of models. We believe that our findings provide a more robust foundation for understanding the interpretability of ensembles, emphasizing the benefits of examining it through a computational complexity lens.

Paperid:480 poster

Authors:Wei Chen, Jun-Xiang Mao, Xiaozheng Wang, Min-Ling Zhang

Title: Learnware Specification via Dual Alignment

Abstract:
The learnware paradigm aims to establish a learnware dock system that contains numerous leanwares, each consisting of a well-trained model and a specification, enabling users to reuse high-performing models for their tasks instead of training from scratch. The specification, as a unique characterization of the model's specialties, dominates the effectiveness of model reuse. Existing specification methods mainly employ distribution alignment to generate specifications. However, this approach overlooks the model's discriminative performance, hindering an adequate specialty characterization. In this paper, we claim that it is beneficial to incorporate such discriminative performance for high-quality specification generation. Accordingly, a novel specification approach named Dali, i.e., Learnware Specification via Dual ALIgnment, is proposed. In Dali, the characterization of the model's discriminative performance is modeled as discriminative alignment, which is considered along with distribution alignment in the specification generation process. Theoretical and empirical analyses clearly demonstrate that the proposed approach is capable of facilitating model reuse in the learnware paradigm with high-quality specification generation.

Paperid:481 poster

Authors:Mikhail Mironov, Liudmila Prokhorenkova

Title: Measuring Diversity: Axioms and Challenges

Abstract:
This paper addresses the problem of quantifying diversity for a set of objects. First, we conduct a systematic review of existing diversity measures and explore their undesirable behavior in certain cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradictory. Unfortunately, the constructed examples are too computationally expensive (NP-hard) for practical use. Thus, we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice or proving that all such measures are NP-hard to compute.

Paperid:482 poster

Authors:Vladimir Braverman, Jon C. Ergun, Chen Wang, Samson Zhou

Title: Learning-Augmented Hierarchical Clustering

Abstract:
Hierarchical clustering (HC) is an important data analysis technique in which the goal is to recursively partition a dataset into a tree-like structure while grouping together similar data points at each level of granularity. Unfortunately, for many of the proposed HC objectives, there exist strong barriers to approximation algorithms with the hardness of approximation. We consider the problem of hierarchical clustering given auxiliary information from natural oracles in the learning-augmented framework. Our main results are algorithms that, given learning-augmented oracles, compute efficient approximate HC trees for the celebrated Dasgupta's and Moseley-Wang objectives that overcome known hardness barriers.

Paperid:483 poster

Authors:Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael Moseley, Akshay S Chaudhari, Curtis Langlotz

Title: LieRE: Lie Rotational Positional Encodings

Abstract:
Transformer architectures depend on explicit position encodings to capture token positional information. Rotary Position Encoding (RoPE) has emerged as a popular choice in language models due to its efficient encoding of relative position information through key-query rotations. However, RoPE faces significant limitations beyond language processing: it is constrained to one-dimensional sequence data and, even with learnable phases, offers limited representational capacity. We address these challenges with Lie Relative Encodings (LieRE), which generalizes RoPE to high-dimensional rotation matrices by leveraging their Lie group structure. Through extensive evaluation on three image datasets across 2D and 3D classification tasks, LieRE achieves 1.5% improvement over state-of-the-art baselines on 2D tasks and 1% on 3D tasks, while demonstrating superior generalization to higher resolutions. Our implementation is computationally efficient, with results reproducible on 4 A100 GPUs in 30 minutes on CIFAR100. Our code is available athttps://github.com/StanfordMIMI/LieRE.

Paperid:484 poster

Authors:Tiange Liu, Nikola Surjanovic, Miguel Biron-Lattes, Alexandre Bouchard-Cote, Trevor Campbell

Title: AutoStep: Locally adaptive involutive MCMC

Abstract:
Many common Markov chain Monte Carlo (MCMC) kernels can be formulated using a deterministic involutive proposal with a step size parameter. Selecting an appropriate step size is often a challenging task in practice; and for complex multiscale targets, there may not be one choice of step size that works well globally. In this work, we address this problem with a novel class of involutive MCMC methods---AutoStep MCMC---that selects an appropriate step size at each iteration adapted to the local geometry of the target distribution. We prove that under mild conditions AutoStep MCMC is $\pi$-invariant, irreducible, and aperiodic, and obtain bounds on expected energy jump distance and cost per iteration. Empirical results examine the robustness and efficacy of our proposed step size selection procedure, and show that AutoStep MCMC is competitive with state-of-the-art methods in terms of effective sample size per unit cost on a range of challenging target distributions.

Paperid:485 poster

Authors:Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar

Title: LLMs can see and hear without any training

Abstract:
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.

Paperid:486 poster

Authors:Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, Robert Tjarko Lange

Title: Text-to-LoRA: Instant Transformer Adaption

Abstract:
While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyperparameter choices. To overcome these limitations, we introduce Text-to-LoRA (T2L), a model capable of adapting large language models (LLMs) on the fly solely based on a natural language description of the target task. T2L is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training T2L on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, T2L can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements.Our code is available athttps://github.com/SakanaAI/text-to-lora

Paperid:487 poster

Authors:Ruiyi Fang, Bingheng Li, Jingyu Zhao, Ruizhi Pu, QIUHAO Zeng, Gezheng Xu, Charles Ling, Boyu Wang

Title: Homophily Enhanced Graph Domain Adaptation

Abstract:
Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs, addressing the challenge of label scarcity. In this paper, we highlight the significance of graph homophily, a pivotal factor for graph domain alignment, which, however, has long been overlooked in existing approaches. Specifically, our analysis first reveals that homophily discrepancies exist in benchmarks. Moreover, we also show that homophily discrepancies degrade GDA performance from both empirical and theoretical aspects, which further underscores the importance of homophily alignment in GDA. Inspired by this finding, we propose a novel homophily alignment algorithm that employs mixed filters to smooth graph signals, thereby effectively capturing and mitigating homophily discrepancies between graphs. Experimental results on a variety of benchmarks verify the effectiveness of our method.

Paperid:488 poster

Authors:Runtian Zhai, Kai Yang, Burak Varıcı, Che-Ping Tsai, J Zico Kolter, Pradeep Kumar Ravikumar

Title: Contextures: Representations from Contexts

Abstract:
Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn. In this paper, we establish the contexture theory. It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular methods aim to approximate the top-d singular functions of the expectation operator induced by the context, in which case we say that the representation learns the contexture. We demonstrate the generality of the contexture theory by proving that representation learning within various learning paradigms -- supervised, self-supervised, and manifold learning -- can all be studied from such a perspective. We prove that representations that learn the contexture are optimal on those tasks that are compatible with the context. One important implication of our theory is that once the model is large enough to approximate the top singular functions, scaling up the model size yields diminishing returns, so further improvement requires better contexts. To this end, we study how to evaluate a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.

Paperid:489 poster

Authors:Thieu Vo, Hoang V. Tran, Tho Tran Huu, An Nguyen The, Thanh Tran, Minh-Khoi Nguyen-Nhat, Duy-Tung Pham, Tan Minh Nguyen

Title: Equivariant Polynomial Functional Networks

Abstract:
A neural functional network (NFN) is a specialized type of neural network designed to process and learn from entire neural networks as input data. Recent NFNs have been proposed with permutation and scaling equivariance based on either graph-based message-passing mechanisms or parameter-sharing mechanisms. Compared to graph-based models, parameter-sharing-based NFNs built upon equivariant linear layers exhibit lower memory consumption and faster running time. However, their expressivity is limited due to the large size of the symmetric group of the input neural networks. The challenge of designing a permutation and scaling equivariant NFN that maintains low memory consumption and running time while preserving expressivity remains unresolved. In this paper, we propose a novel solution with the development of MAGEP-NFN (Monomial mAtrixGroupEquivariantPolynomialNFN). Our approach follows the parameter-sharing mechanism but differs from previous works by constructing a nonlinear equivariant layer represented as a polynomial in the input weights. This polynomial formulation enables us to incorporate additional relationships between weights from different input hidden layers, enhancing the model's expressivity while keeping memory consumption and running time low, thereby addressing the aforementioned challenge. We provide empirical evidence demonstrating that MAGEP-NFN achieves competitive performance and efficiency compared to existing baselines.

Paperid:490 poster

Authors:Zonghao Chen, Masha Naslidnyk, Francois-Xavier Briol

Title: Nested Expectations with Kernel Quadrature

Abstract:
This paper considers the challenging computational task of estimating nested expectations. Existing algorithms, such as nested Monte Carlo or multilevel Monte Carlo, are known to be consistent but require a large number of samples at both inner and outer levels to converge. Instead, we propose a novel estimator consisting of nested kernel quadrature estimators and we prove that it has a faster convergence rate than all baseline methods when the integrands have sufficient smoothness. We then demonstrate empirically that our proposed method does indeed require the fewest number of samples to estimate nested expectations over a range of real-world application areas from Bayesian optimisation to option pricing and health economics.

Paperid:491 poster

Authors:Liyao Lyu, Jingrun Chen

Title: Consensus Based Stochastic Optimal Control

Abstract:
We propose agradient-freedeep reinforcement learning algorithm to solvehigh-dimensional, finite-horizon stochastic control problems. Although the recently developed deep reinforcement learning framework has achieved great success in solving these problems, direct estimation of policy gradients from Monte Carlo sampling often suffers from high variance. To address this, we introduce the Momentum Consensus-Based Optimization (M-CBO) and Adaptive Momentum Consensus-Based Optimization (Adam-CBO) frameworks. These methods optimize policies using Monte Carlo estimates of the value function, rather than its gradients. Adjustable Gaussian noise supports efficient exploration, helping the algorithm converge to optimal policies in complex, nonconvex environments. Numerical results confirm the accuracy and scalability of our approach across various problem dimensions and show the potential for extension to mean-field control problems. Theoretically, we prove that M-CBO can converge to the optimal policy under some assumptions.

Paperid:492 poster

Authors:Anissa Alloula, Charles Jones, Ben Glocker, Bartlomiej Papiez

Title: Subgroups Matter for Robust Bias Mitigation

Abstract:
Despite the constant development of new bias mitigation methods for machine learning, no method consistently succeeds, and a fundamental question remains unanswered: when and why do bias mitigation techniques fail? In this paper, we hypothesise that a key factor may be the often-overlooked but crucial step shared by many bias mitigation methods: the definition of subgroups. To investigate this, we conduct a comprehensive evaluation of state-of-the-art bias mitigation methods across multiple vision and language classification tasks, systematically varying subgroup definitions, including coarse, fine-grained, intersectional, and noisy subgroups. Our findings reveal that subgroup choice significantly impacts performance, with certain groupings paradoxically leading to worse outcomes than no mitigation at all. They suggest that observing a disparity between a set of subgroups is not a sufficient reason to use those subgroups for mitigation. Through theoretical analysis, we explain these phenomena and uncover a counter-intuitive insight that, in some cases, improving fairness with respect to a particular set of subgroups is best achieved by using a different set of subgroups for mitigation. Our work highlights the importance of careful subgroup definition in bias mitigation and presents it as an alternative lever for improving the robustness and fairness of machine learning models.

Paperid:493 poster

Authors:Yinyan Bu, Jiajie Yu, Kai Zheng, Xinyu Zhang, Piya Pal

Title: NEAR: Neural Electromagnetic Array Response

Abstract:
We address the challenge of achieving angular super-resolution in multi-antenna radar systems that are widely used for localization, navigation, and automotive perception. A multi-antenna radar achieves very high resolution by computationally creating a large virtual sensing system using very few physical antennas. However, practical constraints imposed by hardware, noise, and a limited number of antennas can impede its performance. Conventional supervised learning models that rely on extensive pre-training with large datasets, often exhibit poor generalization in unseen environments. To overcome these limitations, we propose NEAR, an untrained implicit neural representation (INR) framework that predicts radar responses at unseen locations from sparse measurements, by leveraging latent harmonic structures inherent in radar wave propagation. We establish new theoretical results linking antenna array response to expressive power of INR architectures, and develop a novel physics-informed and latent geometry-aware regularizer. Our approach integrates classical signal representation with modern implicit neural learning, enabling high-resolution radar sensing that is both interpretable and generalizable. Extensive simulations and real-world experiments using radar platforms demonstrate NEAR's effectiveness and its ability to adapt to unseen environments.

Paperid:494 poster

Authors:Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael

Title: Isolated Causal Effects of Natural Language

Abstract:
As language technologies become widespread, it is important to understand how changes in language affect reader perceptions and behaviors. These relationships may be formalized as theisolated causal effectof somefocallanguage-encoded intervention (e.g., factual inaccuracies) on an external outcome (e.g., readers' beliefs). In this paper, we introduce a formal estimation framework for isolated causal effects of language. We show that a core challenge of estimating isolated effects is the need to approximate allnon-focallanguage outside of the intervention. Drawing on the principle ofomitted variable bias, we provide measures for evaluating the quality of both non-focal language approximations and isolated effect estimates themselves. We find that poor approximation of non-focal language can lead to bias in the corresponding isolated effect estimates due to omission of relevant variables, and we show how to assess the sensitivity of effect estimates to such bias along the two key axes offidelityandoverlap. In experiments on semi-synthetic and real-world data, we validate the ability of our framework to correctly recover isolated effects and demonstrate the utility of our proposed measures.

Paperid:495 poster

Authors:Alon Beck, Noam Itzhak Levi, Yohai Bar-Sinai

Title: Grokking at the Edge of Linear Separability

Abstract:
We investigate the phenomenon of grokking -- delayed generalization accompanied by non-monotonic test loss behavior -- in a simple binary logistic classification task, for which "memorizing" and "generalizing" solutions can be strictly defined. Surprisingly, we find that grokking arises naturally even in this minimal model when the parameters of the problem are close to a critical point, and provide both empirical and analytical insights into its mechanism. Concretely, by appealing to the implicit bias of gradient descent, we show that logistic regression can exhibit grokking when the training dataset is nearly linearly separable from the origin and there is strong noise in the perpendicular directions. The underlying reason is that near the critical point, "flat" directions in the loss landscape with nearly zero gradient cause training dynamics to linger for arbitrarily long times near quasi-stable solutions before eventually reaching the global minimum. Finally, we highlight similarities between our findings and the recent literature, strengthening the conjecture that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.

Paperid:496 poster

Authors:Niklas Pfister, Václav Volhejn, Manuel Knott, Santiago Arias, Julia Bazinska, Mykhailo Bichurin, Alan Y. Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Damian Pascual-Ortiz, Jakub Podolak, Adrià Romero-López, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, et al. (6 additional authors not shown)

Title: Gandalf the Red: Adaptive Security for LLMs

Abstract:
Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications.

Paperid:497 poster

Authors:Sungwon Han, Seungeon Lee, Meeyoung Cha, Sercan O Arik, Jinsung Yoon

Title: Retrieval Augmented Time Series Forecasting

Abstract:
Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model's learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model's capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on ten benchmark datasets show that RAFT consistently outperforms contemporary baselines with an average win ratio of 86%.

Paperid:498 poster

Authors:Nico Meyer, Christian Ufrecht, George Yammine, Georgios Kontes, Christopher Mutschler, Daniel Scherer

Title: Benchmarking Quantum Reinforcement Learning

Abstract:
Benchmarking and establishing proper statistical validation metrics for reinforcement learning (RL) remain ongoing challenges, where no consensus has been established yet. The emergence of quantum computing and its potential applications in quantum reinforcement learning (QRL) further complicate benchmarking efforts. To enable valid performance comparisons and to streamline current research in this area, we propose a novel benchmarking methodology, which is based on a statistical estimator for sample complexity and a definition of statistical outperformance. Furthermore, considering QRL, our methodology casts doubt on some previous claims regarding its superiority. We conducted experiments on a novel benchmarking environment with flexible levels of complexity. While we still identify possible advantages, our findings are more nuanced overall. We discuss the potential limitations of these results and explore their implications for empirical research on quantum advantage in QRL.

Paperid:499 poster

Authors:Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng

Title: Free Process Rewards without Process Labels

Abstract:
Different from its counterpart outcome reward models (ORMs), which evaluate the entire responses, a process reward model (PRM) scores a reasoning trajectory step by step, providing denser and more fine-grained rewards. However, training a PRM requires labels annotated at every intermediate step, presenting significant challenges for both manual and automatic data collection. This paper aims to address this challenge. Both theoretically and empirically, we show that an implicit PRM can be obtained at no additional cost, by simply training an ORM on the cheaper response-level labels. The only assumption is to parameterize the outcome reward as the log-likelihood ratios of the policy and reference models rϕ(y) = β log πϕ(y) πref(y) , which can be optimized regardless of the specific choice of loss objectives. In experiments, we instantiate our implicit PRMs with various objectives and evaluate their performance on MATH. We show that our implicit PRM outperforms a strong MCTS-based baseline á la Math-Shepherd (Wang et al., 2023) using less than 1/38 of the training data. Its performance can be further improved with majority voting. We further find that scaling up instructions and responses benefits our implicit PRM, and the latter brings a larger gain. Particularly, we find that our implicit PRM, when instantiated with the cross-entropy (CE) loss, is more data-efficient and can keep improving generation models even when trained with only one response per instruction, the setup that suffers from extreme data scarcity and imbalance. Further, instructions should be relevant to downstream tasks while the diversity of responses does not bring gains. Surprisingly, training on extra Math-Shepherd step labels brings no further improvements to our implicit PRM trained on only outcome data. We hope that our work will encourage a rethinking of PRM training approaches and contribute to making training PRMs more accessible.

Paperid:500 poster

Authors:Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin

Title: Improving the Diffusability of Autoencoders

Abstract:
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256. The source code is available athttps://github.com/snap-research/diffusability.

Paperid:501 poster

Authors:Song Bian, Minghao Yan, Shivaram Venkataraman

Title: Scaling Inference-Efficient Language Models

Abstract:
Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to3.5×difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by1.8×while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff. Notably, our experiments reveal that wider and shallower models can yield efficiency gains while preserving accuracy.

Paperid:502 poster

Authors:Yang Xu, Wenbin Lu, Rui Song

Title: Linear Contextual Bandits With Interference

Abstract:
Interference, a key concept in causal inference, extends the reward modeling process by accounting for the impact of one unit's actions on the rewards of others. In contextual bandit (CB) settings where multiple units are present in the same round, interference can significantly affect the estimation of expected rewards for different arms, thereby influencing the decision-making process. Although some prior work has explored multi-agent and adversarial bandits in interference-aware settings, how to model interference in CB remains significantly underexplored. In this paper, we introduce a systematic framework to address interference in Linear CB (LinCB), bridging the gap between causal inference and online decision-making. We propose a series of algorithms that explicitly quantify the interference effect in the reward modeling process and provide comprehensive theoretical guarantees, including sublinear regret bounds, finite sample upper bounds, and asymptotic properties. The effectiveness of our approach is demonstrated through simulations and a synthetic data generated based on MovieLens data.

Paperid:503 poster

Authors:Meilin Wang, Wei Huang, Mingming Gong, Zheng Zhang

Title: Projection Pursuit Density Ratio Estimation

Abstract:
Density ratio estimation(DRE) is a paramount task in machine learning, for its broad applications across multiple domains, such as covariate shift adaptation, causal inference, independence tests and beyond. Parametric methods for estimating the density ratio possibly lead to biased results if models are misspecified, while conventional non-parametric methods suffer from the curse of dimensionality when the dimension of data is large. To address these challenges, in this paper, we propose a novel approach for DRE based on the projection pursuit (PP) approximation. The proposed method leverages PP to mitigate the impact of high dimensionality while retaining the model flexibility needed for the accuracy of DRE. We establish the consistency and the convergence rate for the proposed estimator. Experimental results demonstrate that our proposed method outperforms existing alternatives in various applications.

Paperid:504 poster

Authors:Ilias Diakonikolas, Daniel Kane, Jasper C.H. Lee, Thanasis Pittas, David Woodruff, Samson Zhou

Title: On Fine-Grained Distinct Element Estimation

Abstract:
We study the problem of distributed distinct element estimation, whereαservers each receive a subset of a universe[n]and aim to compute a(1+ε)-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound ofΘ(αlog⁡n+αε2)bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the numberC=βε2of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses onlyO(αlog⁡nlog⁡log⁡n+βε2log⁡n)bits, breaking previous lower bounds whenCis small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishingCas a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than1. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice.

Paperid:505 poster

Authors:Tianren Zhang, Guanyu Chen, Feng Chen

Title: When do neural networks learn world models?

Abstract:
Humans developworld modelsthat capture the underlying generation process of data. Whether neural networks can learn similar world models remains an open problem. In this work, we present the first theoretical results for this problem, showing that in amulti-tasksetting, models with alow-degree biasprovably recover latent data-generating variables under mild assumptions--even if proxy tasks involve complex, non-linear functions of the latents. However, such recovery is sensitive to model architecture. Our analysis leverages Boolean models of task solutions via the Fourier-Walsh transform and introduces new techniques for analyzing invertible Boolean transforms, which may be of independent interest. We illustrate the algorithmic implications of our results and connect them to related research areas, including self-supervised learning, out-of-distribution generalization, and the linear representation hypothesis in large language models.

Paperid:506 poster

Authors:Shuqi Gu, Chuyue Li, Baoyu Jing, Kan Ren

Title: VerbalTS: Generating Time Series from Texts

Abstract:
Time series synthesis has become a foundational task in modern society, underpinning decision-making across various scenes. Recent approaches primarily generate time series from structured conditions, such as attribute-based metadata. However, these methods struggle to capture the full complexity of time series, as the predefined structures often fail to reflect intricate temporal dynamics or other nuanced characteristics. Moreover, constructing structured metadata requires expert knowledge, making large-scale data labeling costly and impractical. In this paper, we introduce VerbalTS, a novel framework for generating time series from unstructured textual descriptions, offering a more expressive and flexible solution to time series synthesis. To bridge the gap between unstructured text and time series data, VerbalTS employs a multi-focal alignment and generation framework, effectively modeling their complex relationships. Experiments on two synthetic and four real-world datasets demonstrate that VerbalTS outperforms existing methods in both generation quality and semantic alignment with textual conditions.

Paperid:507 poster

Authors:Severi Rissanen, RuiKang OuYang, Jiajun He, Wenlin Chen, Markus Heinonen, Arno Solin, José Miguel Hernández-Lobato

Title: Progressive Tempering Sampler with Diffusion

Abstract:
Recent research has focused on designing neural samplers that amortize the process of sampling from unnormalized densities. However, despite significant advancements, they still fall short of the state-of-the-art MCMC approach, Parallel Tempering (PT), when it comes to the efficiency of target evaluations. On the other hand, unlike a well-trained neural sampler, PT yields only dependent samples and needs to be rerun---at considerable computational cost---whenever new samples are required. To address these weaknesses, we propose the Progressive Tempering Sampler with Diffusion (PTSD), which trains diffusion models sequentially across temperatures, leveraging the advantages of PT to improve the training of neural samplers. We also introduce a novel method to combine high-temperature diffusion models to generate approximate lower-temperature samples, which are minimally refined using MCMC and used to train the next diffusion model. PTSD enables efficient reuse of sample information across temperature levels while generating well-mixed, uncorrelated samples. Our method significantly improves target evaluation efficiency, outperforming diffusion-based neural samplers.

Paperid:508 poster

Authors:Shriya Bhatija, Paul-David Zuercher, Jakob Thumm, Thomas Bohné

Title: Multi-Objective Causal Bayesian Optimization

Abstract:
In decision-making problems, the outcome of an intervention often depends on the causal relationships between system components and is highly costly to evaluate. In such settings, causal Bayesian optimization (CBO) exploits the causal relationships between the system variables and sequentially performs interventions to approach the optimum with minimal data. Extending CBO to the multi-outcome setting, we proposemulti-objective Causal Bayesian optimization(MO-CBO), a paradigm for identifying Pareto-optimal interventions within a known multi-target causal graph. Our methodology first reduces the search space by discarding sub-optimal interventions based on the structure of the given causal graph. We further show that any MO-CBO problem can be decomposed into several traditional multi-objective optimization tasks. Our proposed MO-CBO algorithm is designed to identify Pareto-optimal interventions by iteratively exploring these underlying tasks, guided by relative hypervolume improvement. Experiments on synthetic and real-world causal graphs demonstrate the superiority of our approach over non-causal multi-objective Bayesian optimization in settings where causal information is available.

Paperid:509 poster

Authors:Christian Fabian, Kai Cui, Heinz Koeppl

Title: Learning Mean Field Control on Sparse Graphs

Abstract:
Large agent networks are abundant in applications and nature and pose difficult challenges in the field of multi-agent reinforcement learning (MARL) due to their computational and theoretical complexity. While graphon mean field games and their extensions provide efficient learning algorithms for dense and moderately sparse agent networks, the case of realistic sparser graphs remains largely unsolved. Thus, we propose a novel mean field control model inspired by local weak convergence to include sparse graphs such as power law networks with coefficients above two. Besides a theoretical analysis, we design scalable learning algorithms which apply to the challenging class of graph sequences with finite first moment. We compare our model and algorithms for various examples on synthetic and real world networks with mean field algorithms based on Lp graphons and graphexes. As it turns out, our approach outperforms existing methods in many examples and on various networks due to the special design aiming at an important, but so far hard to solve class of MARL problems.

Paperid:510 poster

Authors:Gauthier Thurin, Kimia Nadjahi, Claire Boyer

Title: Optimal transport-based conformal prediction

Abstract:
Conformal Prediction (CP) is a principled framework for quantifying uncertainty in black-box learning models, by constructing prediction sets with finite-sample coverage guarantees. Traditional approaches rely on scalar nonconformity scores, which fail to fully exploit the geometric structure of multivariate outputs, such as in multi-output regression or multiclass classification. Recent methods addressing this limitation impose predefined convex shapes for the prediction sets, potentially misaligning with the intrinsic data geometry. We introduce a novel CP procedure handling multivariate score functions through the lens of optimal transport. Specifically, we leverage Monge-Kantorovich vector ranks and quantiles to construct prediction region with flexible, potentially non-convex shapes, better suited to the complex uncertainty patterns encountered in multivariate learning tasks. We prove that our approach ensures finite-sample, distribution-free coverage properties, similar to typical CP methods. We then adapt our method for multi-output regression and multiclass classification, and also propose simple adjustments to generate adaptive prediction regions with asymptotic conditional coverage guarantees. Finally, we evaluate our method on practical regression and classification problems, illustrating its advantages in terms of (conditional) coverage and efficiency.

Paperid:511 poster

Authors:Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang

Title: Constrain Alignment with Sparse Autoencoders

Abstract:
The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often experience computational inefficiencies and training instability. In this paper, we propose \textbf{F}eature-level constrained \textbf{P}reference \textbf{O}ptimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves an above 5% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

Paperid:512 poster

Authors:Yihan Du, Anna Winnicki, Gal Dalal, Shie Mannor, R. Srikant

Title: Reinforcement Learning with Segment Feedback

Abstract:
Standard reinforcement learning (RL) assumes that an agent can observe a reward for each state-action pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In this work, we consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. In this model, we consider an episodic Markov decision process (MDP), where each episode is divided intomsegments, and the agent observes reward feedback only at the end of each segment. Under this model, we study two popular feedback settings: binary feedback and sum feedback, where the agent observes a binary outcome and a reward sum according to the underlying reward function, respectively. To investigate the impact of the number of segmentsmon learning performance, we design efficient algorithms and establish regret upper and lower bounds for both feedback settings. Our theoretical and experimental results show that: under binary feedback, increasing the number of segmentsmdecreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasingmdoes not reduce the regret significantly.

Paperid:513 poster

Authors:Hilal Asi, Vinod Raman, Kunal Talwar

Title: Faster Rates for Private Adversarial Bandits

Abstract:
We design new differentially private algorithms for the problems of adversarial bandits and bandits with expert advice. For adversarial bandits, we give a simple and efficient conversion of any non-private bandit algorithm to a private bandit algorithm. Instantiating our conversion with existing non-private bandit algorithms gives a regret upper bound ofO(KTϵ), improving upon the existing upper boundO(KTlog�?KT)ϵ)for allϵ�?. In particular, our algorithms allow for sublinear expected regret even whenϵ�?T, establishing the first known separation between central and local differential privacy for this problem. For bandits with expert advice, we give the first differentially private algorithms, with expected regretO(NTϵ),O(KTlog�?N)log�?KT)ϵ), andO~(N1/6K1/2T2/3log�?NT)ϵ1/3+N1/2log�?NT)ϵ), whereKandNare the number of actions and experts respectively. These rates allow us to get sublinear regret for different combinations of small and largeK,Nandϵ.

Paperid:514 poster

Authors:Aditi Mavalankar, Hassan Mansoor, Zita Marinho, Mariia Samsikova, Tom Schaul

Title: AuPair: Golden Example Pairs for Code Repair

Abstract:
Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response or guess, the LLM corrects its own mistake and produces an improved response or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget ofNLLM calls per problem,NAuPairs are used to generateNrepaired solutions, out of which the highest-scoring solution is the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-Nand self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows stronger scaling with inference-time compute budget compared to baselines.

Paperid:515 poster

Authors:Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann

Title: Generalized Interpolating Discrete Diffusion

Abstract:
While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code:https://github.com/dvruette/gidd/

Paperid:516 poster

Authors:Siqi Huang, Yanchen Xu, Hongyuan Zhang, Xuelong Li

Title: Learn Beneficial Noise as Graph Augmentation

Abstract:
Although graph contrastive learning (GCL) has been widely investigated, it is still a challenge to generate effective and stable graph augmentations. Existing methods often apply heuristic augmentation like random edge dropping, which may disrupt important graph structures and result in unstable GCL performance. In this paper, we proposePositive-incentiveNoise drivenGraphDataAugmentation (PiNGDA), where positive-incentive noise (pi-noise) scientifically analyzes the beneficial effect of noise under the information theory. To bridge the standard GCL and pi-noise framework, we design a Gaussian auxiliary variable to convert the loss function to information entropy. We prove that the standard GCL with pre-defined augmentations is equivalent to estimate the beneficial noise via the point estimation. Following our analysis, PiNGDA is derived from learning the beneficial noise on both topology and attributes through a trainable noise generator for graph augmentations, instead of the simple estimation. Since the generator learns how to produce beneficial perturbations on graph topology and node attributes, PiNGDA is more reliable compared with the existing methods. Extensive experimental results validate the effectiveness and stability of PiNGDA.

Paperid:517 poster

Authors:Haochen Zhang, Zhong Zheng, Lingzhou Xue

Title: Gap-Dependent Bounds for FederatedQ-Learning

Abstract:
We present the first gap-dependent analysis of regret and communication cost for on-policy federatedQ-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading toT-type regret bounds and communication cost bounds with alog⁡Tterm scaling with the number of agentsM, statesS, and actionsA, whereTis the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve alog⁡T-type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence onMSAfrom thelog⁡Tterm. Notably, our gap-dependent communication cost bound also yields a better global switching cost whenM=1, removingSAfrom thelog⁡Tterm.

Paperid:518 poster

Authors:Zhixiong Zhuang, Hui-Po Wang, Maria-Irina Nicolae, Mario Fritz

Title: Stealix: Model Stealing via Prompt Evolution

Abstract:
Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information. Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise. To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names. In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model’s data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images. Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized.

Paperid:519 poster

Authors:Jinwu Hu, Zitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, Mingkui Tan

Title: Test-Time Learning for Large Language Models

Abstract:
While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.

Paperid:520 poster

Authors:Xinyue Xu, Yueying Hu, Hui Tang, Yi Qin, Lu Mi, Hao Wang, Xiaomeng Li

Title: Concept-Based Unsupervised Domain Adaptation

Abstract:
Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through human-understandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.

Paperid:521 poster

Authors:Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, Xiang Yue

Title: Demystifying Long Chain-of-Thought Reasoning

Abstract:
Scaling inference compute has become a key driver of advanced reasoning in large language models (LLMs). A proven approach for scaling inference compute is to generate long chains-of-thought (CoTs), enabling models to engage in structured reasoning strategies such as backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the underlyingmechanics of long CoT reasoning—examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings: 1) while SFT is not strictly necessary, it significantly simplifies training and improves efficiency; 2) reasoning capabilities tend to emerge with increased training compute but are not guaranteed, making reward shaping essential for stabilizing CoT length growth; and 3) scaling verifiable reward signals is critical for RL, and we find that leveraging noisy, web-extracted solutions with filtering mechanisms shows promising potential, particularly in out-of-distribution (OOD) reasoning tasks such as STEM problem-solving. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs.

Paperid:522 poster

Authors:Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fried Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, et al. (3 additional authors not shown)

Title: MIB: A Mechanistic Interpretability Benchmark

Abstract:
How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of lasting evaluation standards, we propose MIB, a Mechanistic Interpretability Benchmark, with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or causal variables in neural language models. The circuit localization track compares methods that locate the model components---and connections between them---most important for performing a task (e.g., attribution patching or information flow routes). The causal variable track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAE) or distributed alignment search (DAS), and align those features to a task-relevant causal variable. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAEs features are not better than neurons, i.e., non-featurized hidden vectors. These findings illustrate that MIB enables meaningful comparisons, and increases our confidence that there has been real progress in the field.

Paperid:523 poster

Authors:Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson

Title: Sparse Autoencoders for Hypothesis Generation

Abstract:
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g.,mentions being surprised or shocked) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.

Paperid:524 poster

Authors:Zhaolu Liu, Robert Peach, Mauricio Barahona

Title: Permutation-Free High-Order Interaction Tests

Abstract:
Kernel-based hypothesis tests offer a flexible, non-parametric tool to detect high-order interactions in multivariate data, beyond pairwise relationships. Yet the scalability of such tests is limited by the computationally demanding permutation schemes used to generate null approximations. Here we introduce a family of permutation-free high-order tests for joint independence and partial factorisations of $d$ variables. Our tests eliminate the need for permutation-based approximations by leveraging V-statistics and a novel cross-centring technique to yield test statistics with a standard normal limiting distribution under the null. We present implementations of the tests and showcase their efficacy and scalability through synthetic datasets. We also show applications inspired by causal discovery and feature selection, which highlight both the importance of high-order interactions in data and the need for efficient computational methods.

Paperid:525 poster

Authors:Christian Lange, Max Hamilton, Elijah Cole, Alexander Shepard, Samuel Heinrich, Angela Zhu, Subhransu Maji, Grant Van Horn, Oisin Mac Aodha

Title: Feedforward Few-shot Species Range Estimation

Abstract:
Knowing where a particular species can or cannot be found on Earth is crucial for ecological research and conservation efforts. By mapping the spatial ranges of all species, we would obtain deeper insights into how global biodiversity is affected by climate change and habitat loss. However, accurate range estimates are only available for a relatively small proportion of all known species. For the majority of the remaining species, we typically only have a small number of records denoting the spatial locations where they have previously been observed. We outline a new approach for few-shot species range estimation to address the challenge of accurately estimating the range of a species from limited data. During inference, our model takes a set of spatial locations as input, along with optional metadata such as text or an image, and outputs a species encoding that can be used to predict the range of a previously unseen species in a feedforward manner. We evaluate our approach on two challenging benchmarks, where we obtain state-of-the-art range estimation performance, in a fraction of the compute time, compared to recent alternative approaches.

Paperid:526 poster

Authors:Gaurush Hiranandani, Haolun Wu, Subhojyoti Mukherjee, Sanmi Koyejo

Title: Logits are All We Need to Adapt Closed Models

Abstract:
Many commercial Large Language Models (LLMs) are often closed-source, limiting developers to prompt tuning for aligning content generation with specific applications. While these models currently do not provide access to token logits, we argue that if such access were available, it would enable more powerful adaptation techniques beyond prompt engineering. In this paper, we propose a token-level probability reweighting framework that, given access to logits and a small amount of task-specific data, can effectively steer black-box LLMs toward application-specific content generation. Our approach views next-token prediction through the lens of supervised classification. We show that aligning black-box LLMs with task-specific data can be formulated as a label noise correction problem, leading to Plugin model -- an autoregressive probability reweighting model that operates solely on logits. We provide theoretical justification for why reweighting logits alone is sufficient for task adaptation. Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models. We provide our code at this https URL.

Paperid:527 poster

Authors:Rui Luo, Zhixin Zhou

Title: Conformity Score Averaging for Classification

Abstract:
Conformal prediction provides a robust framework for generating prediction sets with finite-sample coverage guarantees, independent of the underlying data distribution. However, existing methods typically rely on a single conformity score function, which can limit the efficiency and informativeness of the prediction sets. In this paper, we present a novel approach that enhances conformal prediction for multi-class classification by optimally averaging multiple conformity score functions. Our method involves assigning weights to different score functions and employing various data splitting strategies. Additionally, our approach bridges concepts from conformal prediction and model averaging, offering a more flexible and efficient tool for uncertainty quantification in classification tasks. We provide a comprehensive theoretical analysis grounded in Vapnik–Chervonenkis (VC) theory, establishing finite-sample coverage guarantees and demonstrating the efficiency of our method. Empirical evaluations on benchmark datasets show that our weighted averaging approach consistently outperforms single-score methods by producing smaller prediction sets without sacrificing coverage.

Paperid:528 poster

Authors:Euijin You, Hyang-Won Lee

Title: Quadratic Upper Bound for Boosting Robustness

Abstract:
Fast adversarial training (FAT) aims to enhance the robustness of models against adversarial attacks with reduced training time, however, FAT often suffers from compromised robustness due to insufficient exploration of adversarial space. In this paper, we develop a loss function to mitigate the problem of degraded robustness under FAT. Specifically, we derive a quadratic upper bound (QUB) on the adversarial training (AT) loss function and propose to utilize the bound with existing FAT methods. Our experimental results show that applying QUB loss to the existing methods yields significant improvement of robustness. Furthermore, using various metrics, we demonstrate that this improvement is likely to result from the smoothened loss landscape of the resulting model.

Paperid:529 poster

Authors:Deborah Hendrych, Sebastian Pokutta, Mathieu Besançon, David Martínez-Rubio

Title: Secant Line Search for Frank-Wolfe Algorithms

Abstract:
We present a new step-size strategy based on the secant method for Frank-Wolfe algorithms. This strategy, which requires mild assumptions about the function under consideration, can be applied to any Frank-Wolfe algorithm. It is as effective as full line search and, in particular, allows for adapting to the local smoothness of the function, such as in (Pedregosa et al., 2020), but comes with a significantly reduced computational cost, leading to higher effective rates of convergence. We provide theoretical guarantees and demonstrate the effectiveness of the strategy through numerical experiments.

Paperid:530 poster

Authors:Bianca Marin Moreno, Khaled Eldowa, Pierre Gaillard, Margaux Brégère, Nadia Oudjane

Title: Online Episodic Convex Reinforcement Learning

Abstract:
We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent’s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use a novel online mirror descent algorithm with variable constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.

Paperid:531 poster

Authors:Liang Yang, Xin Chen, Jiaming Zhuo, Di Jin, Chuan Wang, Xiaochun Cao, Zhen Wang, Yuanfang Guo

Title: Disentangled Graph Spectral Domain Adaptation

Abstract:
The distribution shifts and the scarcity of labels prevent graph learning methods, especially graph neural networks (GNNs), from generalizing across domains. Compared to Unsupervised Domain Adaptation (UDA) with embedding alignment, Unsupervised Graph Domain Adaptation (UGDA) becomes more challenging in light of the attribute and topology entanglement in the representation. Beyond embedding alignment, UGDA turns to topology alignment but is limited by the ability of the employed topology model and the estimation of pseudo labels. To alleviate this issue, this paper proposed a Disentangled Graph Spectral Domain adaptation (DGSDA) by disentangling attribute and topology alignments and directly aligning flexible graph spectral filters beyond topology. Specifically, Bernstein polynomial approximation, which mimics the behavior of the function to be approximated to a remarkable degree, is employed to capture complicated topology characteristics and avoid the expensive eigenvalue decomposition. Theoretical analysis reveals the tight GDA bound of DGSDA and the rationality of polynomial coefficient regularization. Quantitative and qualitative experiments justify the superiority of the proposed DGSDA.

Paperid:532 poster

Authors:Antoine Gonon, Léon Zheng, Pascal Carrivain, TUNG QUOC LE

Title: Fast Inference with Kronecker-Sparse Matrices

Abstract:
Kronecker-sparse (KS) matrices—whose supports are Kronecker products of identity and all-ones blocks—underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code atgithub.com/PascalCarrivain/ksmm, including a PyTorch-compatibleKSLinearlayer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium.

Paperid:533 poster

Authors:Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu, Helen Zhou

Title: Learning to Route LLMs with Confidence Tokens

Abstract:
Large language models (LLMs) have demonstrated impressive performance on several tasks and are increasingly deployed in real-world applications. However, especially in high-stakes settings, it becomes vital to know when the output of an LLM may be unreliable. Depending on whether an answer is trustworthy, a system can then choose to route the question to another expert, or otherwise fall back on a safe default behavior. In this work, we study the extent to which LLMs can reliably indicate confidence in their answers, and how this notion of confidence can translate into downstream accuracy gains. We propose Self-Reflection with Error-based Feedback (Self-REF), a lightweight training strategy to teach LLMs to express confidence in whether their answers are correct in a reliable manner. Self-REF introduces confidence tokens into the LLM, from which a confidence score can be extracted. Compared to conventional approaches such as verbalizing confidence and examining token probabilities, we demonstrate empirically that confidence tokens show significant improvements in downstream routing and rejection learning tasks.

Paperid:534 poster

Authors:Runzhao Yang, Xiaolong Wu, Zhihong Zhang, Fabian Zhang, Tingxiong Xiao, Zongren Li, Kunlun He, Jinli Suo

Title: DVI:A Derivative-based Vision Network for INR

Abstract:
Recent advancements in computer vision have seen Implicit Neural Representations (INR) becoming a dominant representation form for data due to their compactness and expressive power. To solve various vision tasks with INR data, vision networks can either be purely INR-based, but are thereby limited by simplistic operations and performance constraints, or include raster-based methods, which then tend to lose crucial structural information of the INR during the conversion process. To address these issues, we propose DVI, a novel Derivative-based Vision network for INR, capable of handling a variety of vision tasks across various data modalities, while achieving the best performance among the existing methods by incorporating state of the art raster-based methods into a INR based architecture. DVI excels by extracting semantic information from the high order derivative map of the INR, then seamlessly fusing it into a pre-existing raster-based vision network, enhancing its performance with deeper, task-relevant semantic insights. Extensive experiments on five vision tasks across three data modalities demonstrate DVI's superiority over existing methods. Additionally, our study encompasses comprehensive ablation studies to affirm the efficacy of each element of DVI, the influence of different derivative computation techniques and the impact of derivative orders. Reproducible codes are provided in the supplementary materials.

Paperid:535 poster

Authors:Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

Title: Agent-as-a-Judge: Evaluate Agents with Agents

Abstract:
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes---ignoring the step-by-step nature of the thinking done by agentic systems---or require excessive manual labour. To address this, we introduce theAgent-as-a-Judgeframework, wherein agentic systems are used to evaluate agentic systems. This is a natural extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving processes for more precise evaluations. We apply the Agent-as-a-Judge framework to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we presentDevAI, a new benchmark of 55 realistic AI code generation tasks. DevAI includes rich manual annotations, like a total of 365 hierarchical solution requirements, which make it particularly suitable for an agentic evaluator. We benchmark three of the top code-generating agentic systems using Agent-as-a-Judge and find that our framework dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that this work represents a concrete step towards enabling vastly more sophisticated agentic systems. To help that, our dataset and the full implementation of Agent-as-a-Judge will be publically available athttps://github.com/metauto-ai/agent-as-a-judge

Paperid:536 poster

Authors:Chenghua Liu, Minbo Gao, Zhengfeng Ji, Mingsheng Ying

Title: Quantum Speedup for Hypergraph Sparsification

Abstract:
Graph sparsification serves as a foundation for many algorithms, such as approximation algorithms for graph cuts and Laplacian system solvers. As its natural generalization, hypergraph sparsification has recently gained increasing attention, with broad applications in graph machine learning and other areas. In this work, we propose the first quantum algorithm for hypergraph sparsification, addressing an open problem proposed by Apers and de Wolf (FOCS'20). For a weighted hypergraph with $n$ vertices, $m$ hyperedges, and rank $r$, our algorithm outputs a near-linear size $\varepsilon$-spectral sparsifier in time $\widetilde O(r\sqrt{mn}/\varepsilon)$. This algorithm matches the quantum lower bound for constant $r$ and demonstrates quantum speedup when compared with the state-of-the-art $\widetilde O(mr)$-time classical algorithm. As applications, our algorithm implies quantum speedups for computing hypergraph cut sparsifiers, approximating hypergraph mincuts and hypergraph $s$-$t$ mincuts.

Paperid:537 poster

Authors:Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, Yixuan Li

Title: Steer LLM Latents for Hallucination Detection

Abstract:
Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose theTruthfulnessSeparatorVector (TSV), a lightweight and flexible steering vector that reshapes the LLM’s representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

Paperid:538 poster

Authors:Edoardo Cetin, Tianyu Zhao, Yujin Tang

Title: Large Language Models to Diffusion Finetuning

Abstract:
We propose a new finetuning method to provide pre-trained large language models (LMs) the ability to scale test-time compute through the diffusion framework. By increasing the number of diffusion steps, we show our finetuned models achieve monotonically increasing accuracy, directly translating to improved performance across downstream tasks. Furthermore, our finetuned models can expertly answer questions on specific topics by integrating powerful guidance techniques, and autonomously determine the compute required for a given problem by leveraging adaptive ODE solvers. Our method is applicable to any foundation model pre-trained with cross-entropy and does not modify any of its original weights, fully preserving its strong single-step generation capabilities. We show our method can be more effective and is fully compatible with traditional finetuning and search approaches, introducing an orthogonal new direction to unify the strengths of the autoregressive and diffusion frameworks.

Paperid:539 poster

Authors:Kian Ming A. Chai, Edwin V. Bonilla

Title: Variational Learning of Fractional Posteriors

Abstract:
We introduce a novel one-parameter variational objective that lower bounds the data evidence and enables the estimation of approximate fractional posteriors. We extend this framework to hierarchical construction and Bayes posteriors, offering a versatile tool for probabilistic modelling. We demonstrate two cases where gradients can be obtained analytically and a simulation study on mixture models showing that our fractional posteriors can be used to achieve better calibration compared to posteriors from the conventional variational bound. When applied to variational autoencoders (VAEs), our approach attains higher evidence bounds and enables learning of high-performing approximate Bayes posteriors jointly with fractional posteriors. We show that VAEs trained with fractional posteriors produce decoders that are better aligned for generation from the prior.

Paperid:540 poster

Authors:Yi-Fan Zhang, Min-Ling Zhang

Title: Tight and Fast Bounds for Multi-Label Learning

Abstract:
Commonly used evaluation metrics in multi-label learning all involve base loss functions, and the theoretical guarantees of multi-label learning often rely on the properties of base loss functions. Some recent theoretical works have used the Lipschitz continuity of base loss functions to prove the generalization bounds for multi-label learning, but the impact of the smoothness of base loss functions on the generalization bounds is completely unknown. In an attempt to make up for this gap in the generalization theory of multi-label learning, we develop some novel vector-contraction inequalities for smooth base loss functions and derive tight generalization bounds with no dependency on the number of labels, up to logarithmic terms. We then exploit local Rademacher complexity to develop some novel local vector-contraction inequalities for smooth base loss functions, which induce generalization bounds with a tighter dependency on the number of labels and a faster convergence rate with respect to the number of examples. In addition, we derive tight generalization bounds with no dependency on the number of labels, up to logarithmic terms, for Macro-Averaged AUC by exploiting the Lipschitz continuity and smoothness of base loss functions, respectively. Our state-of-the-art theoretical results provide general theoretical guarantees for the generalization of multi-label learning.

Paperid:541 poster

Authors:Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto

Title: Auditing Prompt Caching in Language Model APIs

Abstract:
Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users' prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users' prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI's embedding model is a decoder-only Transformer, which was previously not publicly known.

Paperid:542 poster

Authors:Eric Frankel, Sitan Chen, Jerry Li, Pang Wei Koh, Lillian J. Ratliff, Sewoong Oh

Title: S4S: Solving for a Fast Diffusion Model Solver

Abstract:
Diffusion models (DMs) create samples from a data distribution by starting from random noise and iteratively solving a reverse-time ordinary differential equation (ODE). Because each step in the iterative solution requires an expensive neural function evaluation (NFE), there has been significant interest in approximately solving these diffusion ODEs with only a few NFEs without modifying the underlying model. However, in the few NFE regime, we observe that tracking the true ODE evolution is fundamentally impossible using traditional ODE solvers. In this work, we propose a new method that learns a good solver for the DM, which we callSolvingfortheSolver (S4S). S4S directly optimizes a solver to obtain good generation quality by learning to match the output of a strong teacher solver. We evaluate S4S on six different pre-trained DMs, including pixel-space and latent-space DMs for both conditional and unconditional sampling. In all settings, S4S uniformly improves the sample quality relative to traditional ODE solvers. Moreover, our method is lightweight, data-free, and can be plugged in black-box on top of any discretization schedule or architecture to improve performance. Building on top of this, we also proposeS4S-Alt, which optimizes both the solver and the discretization schedule. By exploiting the full design space of DM solvers, with 5 NFEs, we achieve an FID of 3.73 on CIFAR10 and 13.26 on MS-COCO, representing a $1.5\times$ improvement over previous training-free ODE methods.

Paperid:543 poster

Authors:Robert Geirhos, Priyank Jaini, Austin Stone, Sourabh Medapati, Xi Yi, George Toderici, Abhijit Ogale, Jonathon Shlens

Title: Towards flexible perception with visual memory

Abstract:
Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is nearly impossible, since all information is distributed across the network's weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pre-trained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models---beyond carving it in "stone" weights.

Paperid:544 poster

Authors:Leshem Choshen, Yang Zhang, Jacob Andreas

Title: A Hitchhiker's Guide to Scaling Law Estimation

Abstract:
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pre- training decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substan- tially improves accuracy, and that—all else equal—estimates of performance are generally most accurate when derived from other models of similar sizes. However, because there is a significant degree of variability across model seeds, training multiple small models is sometimes more useful than training a single large one. Moreover, while different model families differ in scaling behavior, they are often similar enough that a target model’s behavior can be predicted from a single model with the same architecture, along with scaling parameter estimates derived from other model families.

Paperid:545 poster

Authors:Zichong Wang, Wenbin Zhang

Title: FDGen: A Fairness-Aware Graph Generation Model

Abstract:
Graph generation models have shown significant potential across various domains. However, despite their success, these models often inherit societal biases, limiting their adoption in real-world applications. Existing research on fairness in graph generation primarily addresses structural bias, overlooking the critical issue of feature bias. To address this gap, we propose FDGen, a novel approach that defines and mitigates both feature and structural biases in graph generation models. Furthermore, we provide a theoretical analysis of how bias sources in graph data contribute to disparities in graph generation tasks. Experimental results on four real-world datasets demonstrate that FDGen outperforms state-of-the-art methods, achieving notable improvements in fairness while maintaining competitive generation performance.

Paperid:546 poster

Authors:Mengxiao Zhang, Yingfei Wang, Haipeng Luo

Title: Contextual Linear Bandits with Delay as Payoff

Abstract:
A recent work by Schlisselberg et al. (2025) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is only of orderDΔmaxlog⁡T, whereTis the total horizon,Dis the maximum delay, andΔmaxis the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2025). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking thevolumetric spannersof the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

Paperid:547 poster

Authors:Ziyue Zhang, Mingbao Lin, Shuicheng YAN, Rongrong Ji

Title: EasyInv: Toward Fast and Better DDIM Inversion

Abstract:
This paper introduces EasyInv, an easy yet novel approach that significantly advances the field of DDIM Inversion by addressing the inherent inefficiencies and performance limitations of traditional iterative optimization methods. At the core of our EasyInv is a refined strategy for approximating inversion noise, which is pivotal for enhancing the accuracy and reliability of the inversion process. By prioritizing the initial latent state, which encapsulates rich information about the original images, EasyInv steers clear of the iterative refinement of noise items. Instead, we introduce a methodical aggregation of the latent state from the preceding time step with the current state, effectively increasing the influence of the initial latent state and mitigating the impact of noise. We illustrate that EasyInv is capable of delivering results that are either on par with or exceed those of the conventional DDIM Inversion approach, especially under conditions where the model's precision is limited or computational resources are scarce. Concurrently, our EasyInv offers an approximate threefold enhancement regarding inference efficiency over off-the-shelf iterative optimization techniques. It can be easily combined with most existing inversion methods by only four lines of code. See code athttps://github.com/potato-kitty/EasyInv.

Paperid:548 poster

Authors:Yuan Tian, Tianyi Zhang

Title: Selective Prompt Anchoring for Code Generation

Abstract:
Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we proposeSelectivePromptAnchoring(SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available athttps://github.com/magic-YuanTian/Selective-Prompt-Anchoring.

Paperid:549 poster

Authors:Freddie Bickford Smith, Jannik Kossen, Eleanor Trollope, Mark van der Wilk, Adam Foster, Tom Rainforth

Title: Rethinking Aleatoric and Epistemic Uncertainty

Abstract:
The ideas of aleatoric and epistemic uncertainty are widely used to reason about the probabilistic predictions of machine-learning models. We identify incoherence in existing discussions of these ideas and suggest this stems from the aleatoric-epistemic view being insufficiently expressive to capture all the distinct quantities that researchers are interested in. To address this we present a decision-theoretic perspective that relates rigorous notions of uncertainty, predictive performance and statistical dispersion in data. This serves to support clearer thinking as the field moves forward. Additionally we provide insights into popular information-theoretic quantities, showing they can be poor estimators of what they are often purported to measure, while also explaining how they can still be useful in guiding data acquisition.

Paperid:550 poster

Authors:Jongin Lim, Sucheol Lee, Daeho Um, Sung-Un Park, Jinwoo Shin

Title: PRIME: Deep Imbalanced Regression with Proxies

Abstract:
Data imbalance remains a fundamental challenge in real-world machine learning. However, most existing work has focused on classification, leaving imbalanced regression underexplored despite its importance in many applications. To address this gap, we propose PRIME, a framework that leverages learnable proxies to construct a balanced and well-ordered feature space for imbalanced regression. At its core, PRIME arranges proxies to be uniformly distributed in the feature space while preserving the ordinal structure of regression targets, and then aligns each sample feature to its corresponding proxy. By using proxies as reference points, PRIME induces the desired structure of learned representations, promoting better generalization, especially in underrepresented target regions. Moreover, since proxy-based alignment resembles classification, PRIME enables the seamless application of class imbalance techniques to regression, facilitating more balanced feature learning. Extensive experiments demonstrate the effectiveness and broad applicability of PRIME, achieving state-of-the-art performance on four real-world regression benchmark datasets across diverse target domains.

Paperid:551 poster

Authors:Jonas Möller, Lukas Pirch, Felix Weissberg, Sebastian Baunsgaard, Thorsten Eisenhofer, Konrad Rieck

Title: Adversarial Inputs for Linear Algebra Backends

Abstract:
Linear algebra is a cornerstone of neural network inference. The efficiency of popular frameworks, such as TensorFlow and PyTorch, critically depends on backend libraries providing highly optimized matrix multiplications and convolutions. A diverse range of these backends exists across platforms, including Intel MKL, Nvidia CUDA, and Apple Accelerate. Although these backends provide equivalent functionality, subtle variations in their implementations can lead to seemingly negligible differences during inference. In this paper, we investigate these minor discrepancies and demonstrate how they can be selectively amplified by adversaries. Specifically, we introduceChimera examples, inputs to models that elicit conflicting predictions depending on the employed backend library. These inputs can even be constructed with integer values, creating a vulnerability exploitable from real-world input domains. We analyze the prevalence and extent of the underlying attack surface and propose corresponding defenses to mitigate this threat.

Paperid:552 poster

Authors:Batiste Le bars, Pierre Humbert

Title: On Volume Minimization in Conformal Regression

Abstract:
We study the question of volume optimality in split conformal regression, a topic still poorly understood in comparison to coverage control. Using the fact that the calibration step can be seen as an empirical volume minimization problem, we first derive a finite-sample upper-bound on the excess volume loss of the interval returned by the classical split method. This important quantity measures the difference in length between the interval obtained with the split method and the shortest oracle prediction interval. Then, we introduceEffOrt, a methodology that modifies the learning step so that the base prediction function is selected in order to minimize the length of the returned intervals. In particular, our theoretical analysis of the excess volume loss of the prediction sets produced byEffOrtreveals the links between the learning and calibration steps, and notably the impact of the choice of the function class of the base predictor. We also introduceAd-EffOrt, an extension of the previous method, which produces intervals whose size adapts to the value of the covariate. Finally, we evaluate the empirical performance and the robustness of our methodologies.

Paperid:553 poster

Authors:Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Title: Parrot: Multilingual Visual Instruction Tuning

Abstract:
The rapid development of Multimodal Large Language Models (MLLMs), such as GPT-4, marks a significant step toward artificial general intelligence. Existing methods typically align vision encoders with LLMs via supervised fine-tuning (SFT), but this often deteriorates their ability to handle multiple languages as training progresses. We empirically observe that imbalanced SFT datasets, largely English-centric, degrade performance on non-English languages due to the failure in multilingual token alignment. To address this, we propose Parrot, a novel approach that leverages textual guidance for visual token alignment at the language level. Parrot conditions visual tokens on diverse language inputs and uses Mixture-of-Experts (MoE) to align multilingual tokens. By computing cross-attention between initial visual features and textual embeddings, we select the most relevant experts, converting visual tokens into language-specific representations. Additionally, we introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions, to assess multilingual capabilities. Parrot achieves state-of-the-art performance on both the multilingual benchmarks and a wide range of multimodal tasks. Code and dataset are available at: \url{https://github.com/AIDC-AI/Parrot}.

Paperid:554 poster

Authors:Mingyu Xing, Lechao Cheng, Shengeng Tang, Yaxiong Wang, Zhun Zhong, Meng Wang

Title: Knowledge Swapping via Learning and Unlearning

Abstract:
We introduce Knowledge Swapping, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low-level representations to higher-level semantics, whereas forgetting tends to occur in the opposite direction—starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of Learning Before Forgetting. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available athttps://github.com/xingmingyu123456/KnowledgeSwapping.

Paperid:555 poster

Authors:Shuai Zhang, Chuan Zhou, Yang Liu, Peng Zhang, Xixun Lin, Shirui Pan

Title: Conformal Anomaly Detection in Event Sequences

Abstract:
Anomaly detection in continuous-time event sequences is a crucial task in safety-critical applications. While existing methods primarily focus on developing a superior test statistic, they fail to provide guarantees regarding the false positive rate (FPR), which undermines their reliability in practical deployments. In this paper, we propose CADES (Conformal Anomaly Detection in Event Sequences), a novel test procedure based on conformal inference for the studied task with finite-sample FPR control. Specifically, by using the time-rescaling theorem, we design two powerful non-conformity scores tailored to event sequences, which exhibit complementary sensitivities to different abnormal patterns. CADES combines these scores with Bonferroni correction to leverage their respective strengths and addresses non-identifiability issues of existing methods. Theoretically, we prove the validity of CADES and further provide strong guarantees on calibration-conditional FPR control. Experimental results on synthetic and real-world datasets, covering various types of anomalies, demonstrate that CADES outperforms state-of-the-art methods while maintaining FPR control.

Paperid:556 poster

Authors:Alexis Bellot, Jonathan Richens, Tom Everitt

Title: The Limits of Predicting Agents from Behaviour

Abstract:
As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.

Paperid:557 poster

Authors:Fanxu Meng, Pingzhi Tang, Fan Jiang, Muhan Zhang

Title: CLOVER: Cross-Layer Orthogonal Vectors Pruning

Abstract:
Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bounded. To address this challenge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8% using vanilla pruning. The combination of CLOVER and TransMLA achieves a speedup of up to 11.1×over LLaMA-2-7B.

Paperid:558 poster

Authors:Mirco Mutti, Jeongyeol Kwon, Shie Mannor, Aviv Tamar

Title: A Classification View on Meta Learning Bandits

Abstract:
Contextual multi-armed bandits are a popular choice to model sequential decision-making.E.g., in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When human design strategies, they aim for the exploration to befast, since the patient's health is at stake, and easy tointerpretfor a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales withHoverHrounds and decision strategies are based on opaque statistical considerations. In this paper, we use an originalclassification viewto meta learn interpretable and fast exploration plans for a fixed collection of banditsM. The plan is prescribed by an interpretabledecision treeprobing decisions' payoff to classify the test bandit. The test regret of the plan in thestochasticandcontextualsetting scales withO(λ�?Cλ(M)log2�?MH)), beingMthe size ofM,λa separation parameter over the bandits, andCλ(M)a novelclassification-coefficientthat fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show thatCλ(M)inherently captures the complexity of the setting.

Paperid:559 poster

Authors:Pablo Barcelo, Alexander Kozachinskiy, Tomasz Steifer

Title: Ehrenfeucht-Haussler Rank and Chain of Thought

Abstract:
The notion ofrankof a Boolean function has been a cornerstone in PAC learning, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a functionfcorresponds to the minimum number ofChain of Thought(CoT) steps required by a single-layer Transformer with hard attention to computef. Based on this characterization we establish tight bounds on the number of CoT steps required for specific problems, showing that�?fold function composition necessitates exactlyℓCoT steps. Furthermore, we analyze the problem of identifying the position of thek-th occurrence of 1 in a Boolean sequence, proving that it requireskCoT steps.

Paperid:560 poster

Authors:Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic

Title: Privacy Attacks on Image AutoRegressive Models

Abstract:
Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) matching state-of-the-art diffusion models (DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a True Positive Rate at False Positive Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results suggest a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they areempiricallysignificantly more vulnerable to privacy attacks compared to DMs that achieve similar performance. We release the code athttps://github.com/sprintml/privacy_attacks_against_iarsfor reproducibility.

Paperid:561 poster

Authors:Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, Andrew Tao

Title: FeatSharp: Your Vision Model Features, Sharper

Abstract:
The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at224×224px, while the "high-resolution" versions are around378�?48px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-resolution vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model training using RADIO as a way of providing richer targets for distillation. Code available athttps://github.com/NVlabs/FeatSharp

Paperid:562 poster

Authors:Kentaro Kanamori, Ken Kobayashi, Satoshi Hara, Takuya Takagi

Title: Algorithmic Recourse for Long-Term Improvement

Abstract:
Algorithmic recourse aims to provide a recourse action for altering an unfavorable prediction given by a model into a favorable one (e.g., loan approval). In practice, it is also desirable to ensure that an action makes the real-world outcome better (e.g., loan repayment). We call this requirementimprovement. Unfortunately, existing methods cannot ensure improvement unless we know the true oracle. To address this issue, we propose a framework for suggesting improvement-oriented actions from a long-term perspective. Specifically, we introduce a new online learning task of assigning actions to a given sequence of instances. We assume that we can observe delayed feedback on whether the past suggested action achieved improvement. Using the feedback, we estimate an action that can achieve improvement for each instance. To solve this task, we propose two approaches based on contextual linear bandit and contextual Bayesian optimization. Experimental results demonstrated that our approaches could assign improvement-oriented actions to more instances than the existing methods.

Paperid:563 poster

Authors:Ayça Takmaz, Cristiano Saltori, Neehar Peri, Tim Meinhardt, Riccardo de Lutio, Laura Leal-Taixé, Aljosa Osep

Title: Towards Learning to Complete Anything in Lidar

Abstract:
We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies.

Paperid:564 poster

Authors:Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, Krystian Mikolajczyk

Title: Hypo3D: Exploring Hypothetical Reasoning in 3D

Abstract:
The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduceHypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers. The code and dataset are publicly available at:https://matchlab-imperial.github.io/Hypo3D.

Paperid:565 poster

Authors:Ben Lonnqvist, Elsa Scialom, Abdulkadir Gokce, Zehra Merchant, Michael Herzog, Martin Schrimpf

Title: Contour Integration Underlies Human-Like Vision

Abstract:
Despite the tremendous success of deep learning in computer vision, models still fall behind humans in generalizing to new input distributions. Existing benchmarks do not investigate the specific failure points of models by analyzing performance under many controlled conditions. Our study systematically dissects where and why models struggle with contour integration - a hallmark of human vision -- by designing an experiment that tests object recognition under various levels of object fragmentation. Humans (n=50) perform at high accuracy, even with few object contours present. This is in contrast to models which exhibit substantially lower sensitivity to increasing object contours, with most of the over 1,000 models we tested barely performing above chance. Only at very large scales (�?Btraining dataset size) do models begin to approach human performance. Importantly, humans exhibit an integration bias - a preference towards recognizing objects made up of directional fragments over directionless fragments. We find that not only do models that share this property perform better at our task, but that this bias also increases with model training dataset size, and training models to exhibit contour integration leads to high shape bias. Taken together, our results suggest that contour integration is a hallmark of object vision that underlies object recognition performance, and may be a mechanism learned from data at scale.

Paperid:566 poster

Authors:The Viet Bui, Tien Anh Mai, Thanh Hong Nguyen

Title: O-MAPL: Offline Multi-agent Preference Learning

Abstract:
Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL). The large joint state-action spaces and intricate inter-agent interactions in MARL make inferring the joint reward function especially challenging. While prior studies in single-agent settings have explored ways to recover reward functions and expert policies from human preference feedback, such studies in MARL remain limited. Existing methods typically combine two separate stages, supervised reward learning, and standard MARL algorithms, leading to unstable training processes. In this work, we exploit the inherent connection between reward functions and Q functions in cooperative MARL to introduce a novel end-to-end preference-based learning framework. Our framework is supported by a carefully designed multi-agent value decomposition strategy that enhances training efficiency. Extensive experiments on two state-of-the-art benchmarks, SMAC and MAMuJoCo, using preference data generated by both rule-based and large language model approaches demonstrate that our algorithm consistently outperforms existing methods across various tasks.

Paperid:567 poster

Authors:Chen Wang, Siyu Hu, Guangming Tan, Weile Jia

Title: ELoRA: Low-Rank Adaptation for Equivariant GNNs

Abstract:
Pre-trained interatomic potentials have become a new paradigm for atomistic materials simulations, enabling accurate and efficient predictions across diverse chemical systems. Despite their promise, fine-tuning is often required for complex tasks to achieve high accuracy. Traditional parameter-efficient fine-tuning approaches are effective in NLP and CV. However, when applied to SO(3) equivariant pre-trained interatomic potentials, these methods will inevitably break equivariance—a critical property for preserving physical symmetries. In this paper, we introduce ELoRA (Equivariant Low-Rank Adaptation), a novel fine-tuning method designed specifically for SO(3) equivariant Graph Neural Networks (GNNs), the backbones in multiple pre-trained interatomic potentials. ELoRA adopts a path-dependent decomposition for weights updating which offers two key advantages: (1) it preserves SO(3) equivariance throughout the fine-tuning process, ensuring physically consistent predictions, and (2) it leverages low-rank adaptations to significantly improve data efficiency. We prove that ELoRA maintains equivariance and demonstrate its effectiveness through comprehensive experiments. On the rMD17 organic dataset, ELoRA achieves a 25.5% improvement in energy prediction accuracy and a 23.7% improvement in force prediction accuracy compared to full-parameter fine-tuning. Similarly, across 10 inorganic datasets, ELoRA achieves average improvements of 12.3% and 14.4% in energy and force predictions, respectively. Code will be made publicly available athttps://github.com/hyjwpk/ELoRA.

Paperid:568 poster

Authors:Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, BO WANG

Title: MedRAX: Medical Reasoning Agent for Chest X-ray

Abstract:
Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available athttps://github.com/bowang-lab/MedRAX

Paperid:569 poster

Authors:Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, Jiwoo Hong, Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Siyu Zhu, Parag Agrawal, Natesh S. Pillai, Sathiya Keerthi

Title: AlphaPO: Reward Shape Matters for LLM Alignment

Abstract:
Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce \textbf{AlphaPO}, a new DAA method that leverages anα-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15% to 50% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape and how one can systematically change it to affect training dynamics, as well as improve alignment performance.

Paperid:570 poster

Authors:Ziyang Wu, Jingyuan Zhang, Druv Pai, XuDong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma

Title: Simplifying DINO via Coding Rate Regularization

Abstract:
DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable --- many hyperparameters need to be carefully tuned to ensure that the representations do not collapse --- which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning. Code and model checkpoints are available athttps://github.com/RobinWu218/SimDINO.

Paperid:571 poster

Authors:Tianyi Qiu, Zhonghao He, Tejasveer Chugh, Max Kleiman-Weiner

Title: The Lock-in Hypothesis: Stagnation by Algorithm

Abstract:
The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber. We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity in human ideas and potentially thelock-inof false beliefs. We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop.Website:https://thelockinhypothesis.com

Paperid:572 poster

Authors:Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan

Title: Potemkin Understanding in Large Language Models

Abstract:
Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs---such as AP exams---are also those used to test people. However, this raises an implication: such benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstratespotemkin understanding:the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.

Paperid:573 poster

Authors:Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, Nikos Zarifis

Title: Online Linear Classification with Massart Noise

Abstract:
We study the task of online learning in the presence of Massart noise. Specifically, instead of assuming that the online adversary chooses an arbitrary sequence of labels, we assume that the context $\boldsymbol{x}$ is selected adversarially but the label $y$ presented to the learner disagrees with the ground-truth label of $\boldsymbol{x}$ with unknown probability {\em at most} $\eta$. We focus on the fundamental class of $\gamma$-margin linear classifiers and present the first computationally efficient algorithm that achieves mistake bound $\eta T + o(T)$.We point out that the mistake bound achieved by our algorithm is qualitatively tight for computationally efficient algorithms; this follows from the fact that, even in the offline setting, achieving 0-1 error better than $\eta$ requires super-polynomial time under standard complexity assumptions.We extend our online learning model to a $k$-arm contextual bandit setting where the rewards---instead of satisfying commonly used realizability assumptions---are consistent, in expectation, with some linear ranking function with weight vector $\boldsymbol{w}^\ast$.Given a list of contexts $\boldsymbol{x}_1,\ldots \boldsymbol{x}_k$, if $\boldsymbol{w}^\cdot \boldsymbol{x}_i > \boldsymbol{w}^\cdot \boldsymbol{x}_j$, the expected reward of action $i$ must be larger than that of $j$ by at least $\Delta$.We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k)~ \Delta T - o(T)$ bigger than choosing a random action at every round.

Paperid:574 poster

Authors:Yang Li, Yuchao Ma, Qi Qi

Title: Optimal Auction Design in the Joint Advertising

Abstract:
Online advertising is a vital revenue source for major internet platforms. Recently, joint advertising, which assigns a bundle of two advertisers in an ad slot instead of allocating a single advertiser, has emerged as an effective method for enhancing allocation efficiency and revenue. However, existing mechanisms for joint advertising fail to realize the optimality, as they tend to focus on individual advertisers and overlook bundle structures. This paper identifies an optimal mechanism for joint advertising in a single-slot setting. For multi-slot joint advertising, we proposeBundleNet, a novel bundle-based neural network approach specifically designed for joint advertising. Our extensive experiments demonstrate that the mechanisms generated byBundleNetapproximate the theoretical analysis results in the single-slot setting and achieve state-of-the-art performance in the multi-slot setting. This significantly increases platform revenue while ensuring approximate dominant strategy incentive compatibility and individual rationality.

Paperid:575 poster

Authors:Adrienne Tuynman, Rémy Degenne

Title: The Batch Complexity of Bandit Pure Exploration

Abstract:
In a fixed-confidence pure exploration problem in stochastic multi-armed bandits, an algorithm iteratively samples arms and should stop as early as possible and return the correct answer to a query about the arms distributions. We are interested in batched methods, which change their sampling behaviour only a few times, between batches of observations. We give an instance-dependent lower bound on the number of batches used by any sample efficient algorithm for any pure exploration task. We then give a general batched algorithm and prove upper bounds on its expected sample complexity and batch complexity. We illustrate both lower and upper bounds on best-arm identification and thresholding bandits.

Paperid:576 poster

Authors:Liam Hodgkinson, Zhichao Wang, Michael W. Mahoney

Title: Models of Heavy-Tailed Mechanistic Universality

Abstract:
Recent theoretical and empirical successes in deep learning, including the celebrated neural scaling laws, are punctuated by the observation that many objects of interest tend to exhibit some form of heavy-tailed or power law behavior. In particular, the prevalence of heavy-tailed spectral densities in Jacobians, Hessians, and weight matrices has led to the introduction of the concept ofheavy-tailed mechanistic universality(HT-MU). Multiple lines of empirical evidence suggest a robust correlation between heavy-tailed metrics and model performance, indicating that HT-MU may be a fundamental aspect of deep learning efficacy. Here, we propose a general family of random matrix models---thehigh-temperature Marchenko-Pastur (HTMP) ensemble---to explore attributes that give rise to heavy-tailed behavior in trained neural networks. Under this model, spectral densities with power laws on (upper and lower) tails arise through a combination of three independent factors (complex correlation structures in the data; reduced temperatures during training; and reduced eigenvector entropy), appearing as an implicit bias in the model structure, and they can be controlled with an "eigenvalue repulsion'' parameter. Implications of our model on other appearances of heavy tails, including neural scaling laws, optimizer trajectories, and the five-plus-one phases of neural network training, are discussed.

Paperid:577 poster

Authors:Lisa Jin, Jianhao Ma, Zechun Liu, Andrey Gromov, Aaron Defazio, Lin Xiao

Title: PARQ: Piecewise-Affine Regularized Quantization

Abstract:
We develop a novel optimization method for quantization-aware training (QAT). Specifically, we show that convex, piecewise-affine regularization (PAR) can effectively induce neural network weights to cluster towards discrete values. We minimize PAR-regularized loss functions using an aggregate proximal stochastic gradient method (AProx) and prove that it enjoys last-iterate convergence. Our approach provides an interpretation of the straight-through estimator (STE), a widely used heuristic for QAT, as the asymptotic form of PARQ. We conduct experiments to demonstrate that PARQ obtains competitive performance on convolution- and transformer-based vision tasks.

Paperid:578 poster

Authors:Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, Ying Shan

Title: Taming Rectified Flow for Inversion and Editing

Abstract:
Rectified-flow-based diffusion transformers like FLUX and OpenSora have demonstrated outstanding performance in the field of image and video generation. Despite their robust generative capabilities, these models often struggle with inversion inaccuracies, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that effectively enhances inversion precision by mitigating the errors in the ODE-solving process of rectified flow. Specifically, we derive the exact formulation of the rectified flow ODE and apply the high-order Taylor expansion to estimate its nonlinear components, significantly enhancing the precision of ODE solutions at each timestep. Building upon RF-Solver, we further propose RF-Edit, a general feature-sharing-based framework for image and video editing. By incorporating self-attention features from the inversion process into the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments across generation, inversion, and editing tasks in both image and video modalities demonstrate the superiority and versatility of our method. The source code is available athttps://github.com/wangjiangshan0725/RF-Solver-Edit.

Paperid:579 poster

Authors:Qihong Song, XitingLiu, Hongyuan Zhu, Joey Tianyi Zhou, Xi Peng, Peng Hu

Title: Deep Unsupervised Hashing via External Guidance

Abstract:
Recently, deep unsupervised hashing has gained considerable attention in image retrieval due to its advantages in cost-free data labeling, computational efficiency, and storage savings. Although existing methods achieve promising performance by leveraging inherent visual structures within the data, they primarily focus on learning discriminative features from unlabeled images through limited internal knowledge, resulting in an intrinsic upper bound on their performance. To break through this intrinsic limitation, we propose a novel method, called Deep Unsupervised Hashing with External Guidance (DUH-EG), which incorporates external textual knowledge as semantic guidance to enhance discrete representation learning. Specifically, our DUH-EG: i) selects representative semantic nouns from an external textual database by minimizing their redundancy, then matches images with them to extract more discriminative external features; and ii) presents a novel bidirectional contrastive learning mechanism to maximize agreement between hash codes in internal and external spaces, thereby capturing discrimination from both external and intrinsic structures in Hamming space. Extensive experiments on four benchmark datasets demonstrate that our DUH-EG remarkably outperforms existing state-of-the-art hashing methods.

Paperid:580 poster

Authors:Shubham Gupta, Thibaut Durand, Graham W. Taylor, Lilian Bialokozowicz

Title: LAST SToP for Modeling Asynchronous Time Series

Abstract:
We present a novel prompt design for Large Language Models (LLMs) tailored toAsynchronous Time Series. Unlike regular time series, which assume values at evenly spaced time points, asynchronous time series consist of timestamped events occurring at irregular intervals, each described in natural language. Our approach effectively utilizes the rich natural language of event descriptions, allowing LLMs to benefit from their broad world knowledge for reasoning across different domains and tasks. This allows us to extend the scope of asynchronous time series analysis beyond forecasting to include tasks like anomaly detection and data imputation. We further introduceStochastic Soft Prompting, a novel prompt-tuning mechanism that significantly improves model performance, outperforming existing finetuning methods such as QLORA. Through extensive experiments on real-world datasets, we demonstrate that our approach achieves state-of-the-art performance across different tasks and datasets.

Paperid:581 poster

Authors:Suho Shin, Chenghao Yang, Haifeng Xu, MohammadTaghi Hajiaghayi

Title: Tokenized Bandit for LLM Decoding and Alignment

Abstract:
We introduce the tokenized linear bandit (TLB) and multi-armed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round $t \in [T]$, a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a random utility from the user, whose expectation is presented by a sequence function mapping the chosen token sequence to a nonnegative real value that depends on the query.In both problems, we first show that learning is impossible without any structure on the sequence function. We introduce a natural assumption, diminishing distance with more commons (DDMC), and propose algorithms with regret $\tilde{O}(L\sqrt{T})$ and $\tilde{O}(L\sqrt{T^{2/3}})$ for TLB and TMAB, respectively. As a side product, we obtain an (almost) optimality of the greedy decoding for LLM decoding algorithm under DDMC, which justifies the unresaonable effectiveness of greedy decoding in several tasks. This also has an immediate application to decoding-time LLM alignment, when the misaligned utility can be represented as the frozen LLM's utility and a linearly realizable latent function. We finally validate our algorithm's performance empirically as well as verify our assumptions using synthetic and real-world datasets.

Paperid:582 poster

Authors:Weimin Wu, Maojiang Su, Jerry Yao-Chieh Hu, Zhao Song, Han Liu

Title: In-Context Deep Learning via Transformer Models

Abstract:
We investigate the transformer's capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

Paperid:583 poster

Authors:Leif Döring, Benedikt Wille, Maximilian Birr, Mihail Bîrsan, Martin Slowik

Title: ADDQ: Adaptive distributional double Q-learning

Abstract:
Bias problems in the estimation of Q-values are a well-known obstacle that slows down convergence of Q-learning and actor-critic methods. One of the reasons of the success of modern RL algorithms is partially a direct or indirect overestimation reduction mechanism. We introduce an easy to implement method built on top of distributional reinforcement learning (DRL) algorithms to deal with the overestimation in a locally adaptive way. Our framework ADDQ is simple to implement, existing DRL implementations can be improved with a few lines of code. We provide theoretical backup and experimental results in tabular, Atari, and MuJoCo environments, comparisons with state-of-the-art methods, and a proof of convergence in the tabular case.

Paperid:584 poster

Authors:Zherui Li, Houcheng Jiang, Hao Chen, Baolong Bi, Zhenhong Zhou, Fei Sun, Junfeng Fang, Xiang Wang

Title: Reinforced Lifelong Editing for Language Models

Abstract:
Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposedRLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a59.24%improvement while requiring only2.11%of the time compared to most approaches.

Paperid:585 poster

Authors:Dixian Zhu, Tianbao Yang, Livnat Jerby

Title: Gradient Aligned Regression via Pairwise Losses

Abstract:
Regression is a fundamental task in machine learning that has garnered extensive attention over the past decades. The conventional approach for regression involves employing loss functions that primarily concentrate on aligning model prediction with the ground truth for each individual data sample. Recent research endeavors have introduced novel perspectives by incorporating label similarity into regression through the imposition of additional pairwise regularization or contrastive learning on the latent feature space, demonstrating their effectiveness. However, there are two drawbacks to these approaches: (i) their pairwise operations in the latent feature space are computationally more expensive than conventional regression losses; (ii) they lack theoretical insights behind these methods. In this work, we propose GAR (Gradient Aligned Regression) as a competitive alternative method in label space, which is constituted by a conventional regression loss and two pairwise label difference losses for gradient alignment including magnitude and direction. GAR enjoys: i) the same level efficiency as conventional regression loss because the quadratic complexity for the proposed pairwise losses can be reduced to linear complexity; ii) theoretical insights from learning the pairwise label difference to learning the gradient of the ground truth function. We limit our current scope as regression on the clean data setting without noises, outliers or distributional shifts, etc. We demonstrate the effectiveness of the proposed method practically on two synthetic datasets and on eight extensive real-world tasks from six benchmark datasets with other eight competitive baselines. Running time experiments demonstrate the superior efficiency of the proposed GAR compared to existing methods with pairwise regularization or contrastive learning in the latent feature space. Additionally, ablation studies confirm the effectiveness of each component of GAR. The code is open sourced athttps://github.com/DixianZhu/GAR.

Paperid:586 poster

Authors:Jiujia Zhang, Ashok Cutkosky

Title: Unconstrained Robust Online Convex Optimization

Abstract:
This paper addresses online learning with ''corrupted'' feedback. Our learner is provided with potentially corrupted gradients $\tilde g_t$ instead of the ''true'' gradients $g_t$. We make no assumptions about how the corruptions arise: they could be the result of outliers, mislabeled data, or even malicious interference. We focus on the difficult ``unconstrained'' setting in which our algorithm must maintain low regret with respect to any comparison point $u \in \mathbb{R}^d$. The unconstrained setting is significantly more challenging as existing algorithms suffer extremely high regret even with very tiny amounts of corruption (which is not true in the case of a bounded domain). Our algorithms guarantee regret $ |u|G (\sqrt{T} + k) $ when $G \ge \max_t |g_t|$ is known, where $k$ is a measure of the total amount of corruption. When $G$ is unknown we incur an extra additive penalty of $(|u|^2+G^2) k$.

Paperid:587 poster

Authors:Jingbang Chen, Xinyuan Cao, Alicia Stepin, Li Chen

Title: On the Power of Learning-Augmented Search Trees

Abstract:
We study learning-augmented binary search trees (BSTs) via Treaps with carefully designed priorities. The result is a simple search tree in which the depth of each item $x$ is determined by its predicted weight $w_x$. Specifically, each item $x$ is assigned a composite priority of $-\lfloor\log\log(1/w_x)\rfloor + U(0, 1)$ where $U(0, 1)$ is the uniform random variable. By choosing $w_x$ as the relative frequency of $x$, the resulting search trees achieve static optimality. This approach generalizes the recent learning-augmented BSTs [Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, by extending them to arbitrary input distributions. Furthermore, we demonstrate that our method can be generalized to a B-Tree data structure using the B-Treap approach [Golovin ICALP'09]. Our search trees are also capable of leveraging localities in the access sequence through online self-reorganization, thereby achieving the working-set property. Additionally, they are robust to prediction errors and support dynamic operations, such as insertions, deletions, and prediction updates. We complement our analysis with an empirical study, demonstrating that our method outperforms prior work and classic data structures.

Paperid:588 poster

Authors:Yang Zhang, Er Jin, Yanfei Dong, Yixuan Wu, Philip Torr, Ashkan Khakzar, Johannes Stegmaier, Kenji Kawaguchi

Title: Minimalist Concept Erasure in Generative Models

Abstract:
Recent advances in generative models have demonstrated remarkable capabilities in producing high-quality images, but their reliance on large-scale unlabeled data has raised significant safety and copyright concerns. Efforts to address these issues by erasing unwanted concepts have shown promise. However, many existing erasure methods involve excessive modifications that compromise the overall utility of the model. In this work, we address these issues by formulating a novel minimalist concept erasure objective basedonlyon the distributional distance of final generation outputs. Building on our formulation, we derive a tractable loss for differentiable optimization that leverages backpropagation through all generation steps in an end-to-end manner. We also conduct extensive analysis to show theoretical connections with other models and methods. To improve the robustness of the erasure, we incorporate neuron masking as an alternative to model fine-tuning. Empirical evaluations on state-of-the-art flow-matching models demonstrate that our method robustly erases concepts without degrading overall model performance, paving the way for safer and more responsible generative models.

Paperid:589 poster

Authors:Yujia Zheng, Shaoan Xie, Kun Zhang

Title: Nonparametric Identification of Latent Concepts

Abstract:
We are born with the ability to learn concepts by comparing diverse observations. This helps us to understand the new world in a compositional manner and facilitates extrapolation, as objects naturally consist of multiple concepts. In this work, we argue that the cognitive mechanism of comparison, fundamental to human learning, is also vital for machines to recover true concepts underlying the data. This offers correctness guarantees for the field of concept learning, which, despite its impressive empirical successes, still lacks general theoretical support. Specifically, we aim to develop a theoretical framework for the identifiability of concepts with multiple classes of observations. We show that with sufficient diversity across classes, hidden concepts can be identified without assuming specific concept types, functional relations, or parametric generative models. Interestingly, even when conditions are not globally satisfied, we can still provide alternative guarantees for as many concepts as possible based on local comparisons, thereby extending the applicability of our theory to more flexible scenarios. Moreover, the hidden structure between classes and concepts can also be identified nonparametrically. We validate our theoretical results in both synthetic and real-world settings.

Paperid:590 poster

Authors:Wenshuai Zhao, Zhiyuan Li, Joni Pajarinen

Title: Learning Progress Driven Multi-Agent Curriculum

Abstract:
The number of agents can be an effective curriculum variable for controlling the difficulty of multi-agent reinforcement learning (MARL) tasks. Existing work typically uses manually defined curricula such as linear schemes. We identify two potential flaws while applying existing reward-based automatic curriculum learning methods in MARL: (1) The expected episode return used to measure task difficulty has high variance; (2) Credit assignment difficulty can be exacerbated in tasks where increasing the number of agents yields higher returns which is common in many MARL tasks. To address these issues, we propose to control the curriculum by using a TD-error basedlearning progressmeasure and by letting the curriculum proceed from an initial context distribution to the final task specific one. Since our approach maintains a distribution over the number of agents and measures learning progress rather than absolute performance, which often increases with the number of agents, we alleviate problem (2). Moreover, the learning progress measure naturally alleviates problem (1) by aggregating returns. In three challenging sparse-reward MARL benchmarks, our approach outperforms state-of-the-art baselines.

Paperid:591 poster

Authors:Chao Xu, Xijia Tang, Chenping Hou

Title: Heterogeneous Label Shift: Theory and Algorithm

Abstract:
In open-environment applications, data are often collected from heterogeneous modalities with distinct encodings, resulting in feature space heterogeneity. This heterogeneity inherently induces label shift, making cross-modal knowledge transfer particularly challenging when the source and target data exhibit simultaneous heterogeneous feature spaces and shifted label distributions. Existing studies address only partial aspects of this issue, leaving the broader problem unresolved. To bridge this gap, we introduce a new concept of Heterogeneous Label Shift (HLS), targeting this critical but underexplored challenge. We first analyze the impact of heterogeneous feature spaces and label distribution shifts on model generalization and introduce a novel error decomposition theorem. Based on these insights, we propose a bound minimization HLS framework that decouples and tackles feature heterogeneity and label shift accordingly. Extensive experiments on various benchmarks for cross-modal classification validate the effectiveness and practical relevance of the proposed approach.

Paperid:592 poster

Authors:Yongxin Yang, Wenqi Chen, Chao Shu, Timothy Hospedales

Title: HyperIV: Real-time Implied Volatility Smoothing

Abstract:
We propose HyperIV, a novel approach for real-time implied volatility smoothing that eliminates the need for traditional calibration procedures. Our method employs a hypernetwork to generate parameters for a compact neural network that constructs complete volatility surfaces within 2 milliseconds, using only 9 market observations. Moreover, the generated surfaces are guaranteed to be free of static arbitrage. Extensive experiments across 8 index options demonstrate that HyperIV achieves superior accuracy compared to existing methods while maintaining computational efficiency. The model also exhibits strong cross-asset generalization capabilities, indicating broader applicability across different market instruments. These key features -- rapid adaptation to market conditions, guaranteed absence of arbitrage, and minimal data requirements -- make HyperIV particularly valuable for real-time trading applications. We make code available athttps://github.com/qmfin/hyperiv.

Paperid:593 poster

Authors:Mingzhe Yang, Sihao Lin, Changlin Li, Xiaojun Chang

Title: Let LLM Tell What to Prune and How Much to Prune

Abstract:
Large language models (LLMs) have revolutionized various AI applications. However, their billions of parameters pose significant challenges for practical deployment. Structured pruning is a hardware-friendly compression technique and receives widespread attention. Nonetheless, existing literature typically targets a single structure of LLMs. We observe that the structure units of LLMs differ in terms of inference cost and functionality. Therefore, pruning a single structure unit in isolation often results in an imbalance between performance and efficiency. In addition, previous works mainly employ a prescribed pruning ratio. Since the significance of LLM modules may vary, it is ideal to distribute the pruning load to a specific structure unit according to its role within LLMs. To address the two issues, we propose a pruning method that targets multiple LLM modules with dynamic pruning ratios. Specifically, we find the intrinsic properties of LLMs can guide us to determine the importance of each module and thus distribute the pruning load on demand, i.e., what to prune and how much to prune. This is achieved by quantifying the complex interactions within LLMs. Extensive experiments on multiple benchmarks and LLM variants demonstrate that our method effectively balances the trade-off between efficiency and performance.

Paperid:594 poster

Authors:Evi Micha, Vasilis Varsamis

Title: Computing Voting Rules with Improvement Feedback

Abstract:
Aggregating preferences under incomplete or constrained feedback is a fundamental problem in social choice and related domains. While prior work has established strong impossibility results for pairwise comparisons, this paper extends the inquiry to improvement feedback, where voters express incremental adjustments rather than complete preferences. We provide a complete characterization of the positional scoring rules that can be computed given improvement feedback. Interestingly, while plurality is learnable under improvement feedback—unlike with pairwise feedback—strong impossibility results persist for many other positional scoring rules. Furthermore, we show that improvement feedback, unlike pairwise feedback, does not suffice for the computation of any Condorcet-consistent rule. We complement our theoretical findings with experimental results, providing further insights into the practical implications of improvement feedback for preference aggregation.

Paperid:595 poster

Authors:Abdel-Rahim Mezidi, Jordan Patracone, Saverio Salzo, Amaury Habrard, Massimiliano Pontil, Rémi Emonet, Marc Sebban

Title: A Bregman Proximal Viewpoint on Neural Operators

Abstract:
We present several advances on neural operators by viewing the action of operator layers as the minimizers of Bregman regularized optimization problems over Banach function spaces. The proposed framework allows interpreting the activation operators as Bregman proximity operators from dual to primal space. This novel viewpoint is general enough to recover classical neural operators as well as a new variant, coined Bregman neural operators, which includes the inverse activation operator and features the same expressivity of standard neural operators. Numerical experiments support the added benefits of the Bregman variant of Fourier neural operators for training deeper and more accurate models.

Paperid:596 poster

Authors:Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh

Title: CRANE: Reasoning with constrained LLM generation

Abstract:
Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoning-augmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO.

Paperid:597 poster

Authors:Jakiw Pidstrigach, Elizabeth Louise Baker, Carles Domingo-Enrich, George Deligiannidis, Nikolas Nüsken

Title: Conditioning Diffusions Using Malliavin Calculus

Abstract:
In generative modelling and stochastic optimal control, a central computational task is to modify a reference diffusion process to maximise a given terminal-time reward. Most existing methods require this reward to be differentiable, using gradients to steer the diffusion towards favourable outcomes. However, in many practical settings, like diffusion bridges, the reward is singular, taking an infinite value if the target is hit and zero otherwise. We introduce a novel framework, based on Malliavin calculus and centred around a generalisation of the Tweedie score formula to nonlinear stochastic differential equations, that enables the development of methods robust to such singularities. This allows our approach to handle a broad range of applications, like diffusion bridges, or adding conditional controls to an already trained diffusion model. We demonstrate that our approach offers stable and reliable training, outperforming existing techniques. As a byproduct, we also introduce a novel score matching objective. Our loss functions are formulated such that they could readily be extended to manifold-valued and infinite dimensional diffusions.

Paperid:598 poster

Authors:Sebastian Rene Towers, Aleksandra Kalisz, Philippe A. Robert, Alicia Higueruelo, Francesca Vianello, Chloe Ming-Han Tsai, Harrison Steel, Jakob Nicolaus Foerster

Title: ADIOS: Antibody Development via Opponent Shaping

Abstract:
Anti-viral therapies are typically designed to target only the current strains of a virus, amyopicresponse. However, therapy-induced selective pressures drive the emergence of new viral strains, against which the original myopic therapies are no longer effective. This evolutionary response presents an opportunity: our therapies could bothdefend against and actively influence viral evolution. This motivates our method ADIOS: Antibody Development vIa Opponent Shaping. ADIOS is a meta-learning framework where the process of antibody therapy design, theouter loop, accounts for the virus's adaptive response, theinner loop. With ADIOS, antibodies are not only robust against potential future variants, they also influence, i.e.shape, which future variants emerge. In line with the opponent shaping literature, we refer to our optimised antibodies asshapers. To demonstrate the value of ADIOS, we build a viral evolution simulator using the Absolut! framework, in which shapers successfully target both current and future viral variants, outperforming myopic antibodies. Furthermore, we show that shapers modify the distribution over viral evolutionary trajectories to result in weaker variants. We believe that our ADIOS paradigm will facilitate the discovery of long-lived vaccines and antibody therapies while also generalising to other domains. Specifically, domains such as antimicrobial resistance, cancer treatment, and others with evolutionarily adaptive opponents. Our code is available athttps://github.com/olakalisz/adios.

Paperid:599 poster

Authors:Sai Surya Duvvuri, Inderjit S Dhillon

Title: LASER: Attention with Exponential Transformation

Abstract:
Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 7.7 billion parameters with an average improvement of upto 1.44% over standard attention on downstream evaluations and 1.65% finetuning improvements. Additionally, LASER demonstrates generalization performance improvement across a variety of tasks (vision, text and speech):Vision Transformer (ViT) on Imagenet, Conformer on the Librispeech speech-to-text and BERT with 2.2 billion parameters.

Paperid:600 poster

Authors:Francesco Bacchiocchi, Jiarui Gan, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Title: Contract Design Under Approximate Best Responses

Abstract:
Principal-agent problems model scenarios where a principal aims at incentivizing an agent to take costly, unobservable actions through the provision of payments. Such interactions are ubiquitous in several real-world applications, ranging from blockchain to the delegation of machine learning tasks. In this paper, we initiate the study of hidden-action principal-agent problems under approximate best responses, in which the agent may select any action that is not too much suboptimal given the principal's payment scheme (a.k.a. contract). Our main result is a polynomial-time algorithm to compute an optimal contract under approximate best responses. This is perhaps surprising, as computing an optimal commitment under approximate best responses is known to be computationally intractable in Stackelberg games. We also investigate the learnability of contracts under approximate best responses, by providing a no-regret learning algorithm for a natural application scenario where the principal does not know anything about the environment.

Paperid:601 poster

Authors:Rachith Aiyappa, Xin Wang, Munjung Kim, Ozgur Can Seckin, Yong-Yeol Ahn, Sadamori Kojaku

Title: Implicit degree bias in the link prediction task

Abstract:
Link prediction---a task of distinguishing actual hidden edges from random unconnected node pairs---is one of the quintessential tasks in graph machine learning. Despite being widely accepted as a universal benchmark and a downstream task for representation learning, the link prediction benchmark's validity has rarely been questioned. Here, we show that the common edge sampling procedure in the link prediction task has an implicit bias toward high-degree nodes. This produces a highly skewed evaluation that favors methods overly dependent on node degree. In fact a ``null'' link prediction method based solely on node degree can yield nearly optimal performance in this setting. We propose a degree-corrected link prediction benchmark that offers a more reasonable assessment and better aligns with the performance on the recommendation task. Finally, we demonstrate that the degree-corrected benchmark can more effectively train graph machine-learning models by reducing overfitting to node degrees and facilitating the learning of relevant structures in graphs.

Paperid:602 poster

Authors:Jie Peng, Hongwei Yang, Jing Zhao, Hengji Dong, Hui He, Weizhe Zhang, Haoyu He

Title: Circumventing Backdoor Space via Weight Symmetry

Abstract:
Deep neural networks are vulnerable to backdoor attacks, where malicious behaviors are implanted during training. While existing defenses can effectively purify compromised models, they typically require labeled data or specific training procedures, making them difficult to apply beyond supervised learning settings. Notably, recent studies have shown successful backdoor attacks across various learning paradigms, highlighting a critical security concern. To address this gap, we propose Two-stage Symmetry Connectivity (TSC), a novel backdoor purification defense that operates independently of data format and requires only a small fraction of clean samples. Through theoretical analysis, we prove that by leveraging permutation invariance in neural networks and quadratic mode connectivity, TSC amplifies the loss on poisoned samples while maintaining bounded clean accuracy. Experiments demonstrate that TSC achieves robust performance comparable to state-of-the-art methods in supervised learning scenarios. Furthermore, TSC generalizes to self-supervised learning frameworks, such as SimCLR and CLIP, maintaining its strong defense capabilities. Our code is available athttps://github.com/JiePeng104/TSC.

Paperid:603 poster

Authors:Bang You, Puze Liu, Huaping Liu, Jan Peters, Oleg Arenz

Title: Maximum Total Correlation Reinforcement Learning

Abstract:
Simplicity is a powerful inductive bias. In reinforcement learning, regularization is used for simpler policies, data augmentation for simpler representations, and sparse reward functions for simpler objectives, all that, with the underlying motivation to increase generalizability and robustness by focusing on the essentials. Supplementary to these techniques, we investigate how to promote simple behavior throughout the episode. To that end, we introduce a modification of the reinforcement learning problem that additionally maximizes the total correlation within the induced trajectories. We propose a practical algorithm that optimizes all models, including policy and state representation, based on a lower-bound approximation. In simulated robot environments, our method naturally generates policies that induce periodic and compressible trajectories, and that exhibit superior robustness to noise and changes in dynamics compared to baseline methods, while also improving performance in the original tasks.

Paperid:604 poster

Authors:Enric Ribera Borrell, Lorenz Richter, Christof Schuette

Title: Reinforcement Learning with Random Time Horizons

Abstract:
We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.

Paperid:605 poster

Authors:Sebastian Bordt, Suraj Srinivas, Valentyn Boreiko, Ulrike von Luxburg

Title: How Much Can We Forget about Data Contamination?

Abstract:
The leakage of benchmark data into the training data has emerged as a significant challenge for evaluating the capabilities of large language models (LLMs). In this work, we challenge the common assumption that small-scale contamination renders benchmark evaluations invalid. First, we experimentally quantify the magnitude of benchmark overfitting based on scaling along three dimensions: The number of model parameters (up to 1.6B), the number of times an example is seen (up to 144), and the number of training tokens (up to 40B). If model and data follow the Chinchilla scaling laws, minor contamination indeed leads to overfitting. At the same time, even 144 times of contamination can be forgotten if the training data is scaled beyond five times Chinchilla, a regime characteristic of many modern LLMs. Continual pre-training of OLMo-7B corroborates these results. Next, we study the impact of the weight decay parameter on example forgetting, showing that empirical forgetting occurs faster than the cumulative weight decay. This allows us to gauge the degree of example forgetting in large-scale training runs, indicating that many LLMs, including Llama 3 405B, have forgotten the data seen at the beginning of training.

Paperid:606 poster

Authors:Yinxuan Huang, KE LIANG, Zhuofan Dong, Xiaodong Qu, Wang Tianxiang, Yue Han, Jingao Xu, Bin Zhou, Ye Wang

Title: Flow Matching for Denoised Social Recommendation

Abstract:
Graph-based social recommendation (SR) models suffer from various noises of the social graphs, hindering their recommendation performances. Either graph-level redundancy or graph-level missing will indeed influence the social graph structures, further influencing the message propagation procedure of graph neural networks (GNNs). Generative models, especially diffusion-based models, are usually used to reconstruct and recover the data in better quality from original data with noises. Motivated by it, a few works take attempts on it for social recommendation. However, they can only handle isotropic Gaussian noises but fail to leverage the anisotropic ones. Meanwhile the anisotropic relational structures in social graphs are commonly seen, so that existing models cannot sufficiently utilize the graph structures, which constraints the capacity of noise removal and recommendation performances. Compared to the diffusion strategy, the flow matching strategy shows better ability to handle the data with anisotropic noises since they can better preserve the data structures during the learning procedure. Inspired by it, we propose RecFlow which is the first flow-matching based SR model. Concretely, RecFlow performs flow matching on the structure representations of social graphs. Then, a conditional learning procedure is designed for optimization. Extensive performances prove the promising performances of our RecFlow from six aspects, including superiority, effectiveness, robustnesses, sensitivity, convergence and visualization.

Paperid:607 poster

Authors:Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, JiaxinHu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, Jun Yao

Title: FlatQuant: Flatness Matters for LLM Quantization

Abstract:
Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at:https://github.com/ruikangliu/FlatQuant.

Paperid:608 poster

Authors:Thore Gerlach, Loong Kuan Lee, Frederic BARBARESCO, Nico Piatkowski

Title: Hybrid Quantum-Classical Multi-Agent Pathfinding

Abstract:
Multi-Agent Path Finding (MAPF) focuses on determining conflict-free paths for multiple agents navigating through a shared space to reach specified goal locations. This problem becomes computationally challenging, particularly when handling large numbers of agents, as frequently encountered in practical applications like coordinating autonomous vehicles. Quantum Computing (QC) is a promising candidate in overcoming such limits. However, current quantum hardware is still in its infancy and thus limited in terms of computing power and error robustness. In this work, we present the first optimal hybrid quantum-classical MAPF algorithms which are based on branch-and-cut-and-prize. QC is integrated by iteratively solving QUBO problems, based on conflict graphs. Experiments on actual quantum hardware and results on benchmark data suggest that our approach dominates previous QUBO formulations and state-of-the-art MAPF solvers.

Paperid:609 poster

Authors:Yuyang Wang, Anurag Ranjan, Joshua M. Susskind, Miguel Ángel Bautista

Title: INRFlow: Flow Matching for INRs in Ambient Space

Abstract:
Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the data compressor. This two-stage paradigm sets obstacles for unifying models across data domains, as hand-crafted compressors architectures are used for different data modalities. To this end, we introduce INRFlow, a domain-agnostic approach to learn flow matching transformers directly in ambient space. Drawing inspiration from INRs, we introduce a conditionally independent point-wise training objective that enables INRFlow to make predictions continuously in coordinate space. Our empirical results demonstrate that INRFlow effectively handles different data modalities such as images, 3D point clouds and protein structure data, achieving strong performance in different domains and outperforming comparable approaches. INRFlow is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.

Paperid:610 poster

Authors:Aymeric Capitaine, Etienne Boursier, Eric Moulines, Michael I. Jordan, Alain Oliviero Durmus

Title: Prediction-Aware Learning in Multi-Agent Systems

Abstract:
The framework of uncoupled online learning in multiplayer games has made significant progress in recent years. In particular, the development of time-varying games has considerably expanded its modeling capabilities. However, current regret bounds quickly become vacuous when the game undergoes significant variations over time, even when these variations are easy to predict. Intuitively, the ability of players to forecast future payoffs should lead to tighter guarantees, yet existing approaches fail to incorporate this aspect. This work aims to fill this gap by introducing a novel prediction-aware framework for time-varying games, where agents can forecast future payoffs and adapt their strategies accordingly. In this framework, payoffs depend on an underlying state of nature that agents predict in an online manner. To leverage these predictions, we propose the POMWU algorithm, a contextual extension of the optimistic Multiplicative Weight Update algorithm, for which we establish theoretical guarantees on social welfare and convergence to equilibrium. Our results demonstrate that, under bounded prediction errors, the proposed framework achieves performance comparable to the static setting. Finally, we empirically demonstrate the effectiveness of POMWU in a traffic routing experiment.

Paperid:611 poster

Authors:Sida Li, Nikolaos Ignatiadis

Title: Prediction-Powered Adaptive Shrinkage Estimation

Abstract:
Prediction-Powered Inference (PPI) is a powerful framework for enhancing statistical estimates by combining limited gold-standard data with machine learning (ML) predictions. While prior work has demonstrated PPI’s benefits for individual statistical problems, modern applications require answering numerous parallel statistical questions. We introduce Prediction-Powered Adaptive Shrinkage (PAS), a method that bridges PPI with empirical Bayes shrinkage to improve estimation of multiple means.PASdebiases noisy ML predictionswithineach task and then borrows strengthacrosstasks by using those same predictions as a reference point for shrinkage. The amount of shrinkage is determined by minimizing an unbiased estimate of risk, and we prove that this tuning strategy is asymptotically optimal. Experiments on both synthetic and real-world datasets show thatPASadapts to the reliability of the ML predictions and outperforms traditional and modern baselines in large-scale applications.

Paperid:612 poster

Authors:Vivek Farias, Joren Gijsbrechts, Aryan I. Khojandi, Tianyi Peng, Andrew Zheng

Title: Speeding up Policy Simulation in Supply Chain RL

Abstract:
Simulating a single trajectory of a dynamical system under some state-dependent policy is a core bottleneck in policy optimization (PO) algorithms. The many inherently serial policy evaluations that must be performed in a single simulation constitute the bulk of this bottleneck. In applying PO to supply chain optimization (SCO) problems, simulating a single sample path corresponding to one month of a supply chain can take several hours. We present an iterative algorithm to accelerate policy simulation, dubbed Picard Iteration. This scheme carefully assigns policy evaluation tasks to independent processes. Within an iteration, any given process evaluates the policy only on its assigned tasks while assuming a certain ‘cached�?evaluation for other tasks; the cache is updated at the end of the iteration. Implemented on GPUs, this scheme admits batched evaluation of the policy across a single trajectory. We prove that the structure afforded by many SCO problems allows convergence in a small number of iterations independent of the horizon. We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments.

Paperid:613 poster

Authors:Matthew Chen, Joshua Engels, Max Tegmark

Title: Low-Rank Adapting Models for Sparse Autoencoders

Abstract:
Sparse autoencoders (SAEs) aim to decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune thelanguage model itselfaround a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% - 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3×to 20×faster on Gemma-2-2B and 2×to 10×faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.

Paperid:614 poster

Authors:Nicholas Goldowsky-Dill, Bilal Chughtai, Stefan Heimersheim, Marius Hobbhahn

Title: Detecting Strategic Deception with Linear Probes

Abstract:
AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient, since the AI might produce seemingly benign outputs while its internal reasoning is misaligned. We thus evaluate if linear probes can robustly detect deception by monitoring model activations. We test two probe-training datasets, one with contrasting instructions to be honest or deceptive (following Zou et al. (2023)) and one of responses to simple roleplaying scenarios. We test whether these probes generalize to realistic settings where Llama-3.3-70B-Instruct behaves deceptively, such as concealing insider trading Scheurer et al. (2023) and purposely underperforming on safety evaluations Benton et al. (2024). We find that our probe distinguishes honest and deceptive responses with AUROCs between 0.96 and 0.999 on our evaluation datasets. If we set the decision threshold to have a 1% false positive rate on chat data not related to deception, our probe catches 95-99% of the deceptive responses. Overall we think white-box probes are promising for future monitoring systems, but current performance is insufficient as a robust defence against deception. Our probes' outputs can be viewed athttps://data.apolloresearch.ai/dd/and our code athttps://github.com/ApolloResearch/deception-detection.

Paperid:615 poster

Authors:Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Sergio Arnaud, Ada Martin, Alexander Sax, Franziska Meier, Katerina Fragkiadaki

Title: Unifying 2D and 3D Vision-Language Understanding

Abstract:
Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available athttps://univlg.github.io.

Paperid:616 poster

Authors:Yifei Xu, Tusher Chakraborty, Emre Kiciman, Bibek Aryal, Srinagesh Sharma, Songwu Lu, Ranveer Chandra

Title: RLTHF: Targeted Human Feedback for LLM Alignment

Abstract:
Fine-tuning large language models (LLMs) to align with user preferences is challenging due to the high cost of quality human annotations in Reinforcement Learning from Human Feedback (RLHF) and the generalizability limitations of AI Feedback. To address these challenges, we propose RLTHF, a human-AI hybrid framework that combines LLM-based initial alignment with selective human annotations to achieve full-human annotation alignment with minimal effort. RLTHF identifies hard-to-annotate samples mislabeled by LLMs using a reward model's reward distribution and iteratively enhances alignment by integrating strategic human corrections while leveraging LLM's correctly labeled samples. Evaluations on HH-RLHF and TL;DR datasets show that RLTHF reaches full-human annotation-level alignment with only 6-7% of the human annotation effort. Furthermore, models trained on RLTHF's curated datasets for downstream tasks outperform those trained on fully human-annotated datasets, underscoring the effectiveness of RLTHF.

Paperid:617 poster

Authors:Xinyu Peng, Ziyang Zheng, Yaoming Wang, Han Li, Nuowen Kan, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong

Title: Noise Conditional Variational Score Distillation

Abstract:
We propose Noise Conditional Variational Score Distillation (NCVSD), a novel method for distilling pretrained diffusion models into generative denoisers. We achieve this by revealing that the unconditional score function implicitly characterizes the score function of denoising posterior distributions. By integrating this insight into the Variational Score Distillation (VSD) framework, we enable scalable learning of generative denoisers capable of approximating samples from the denoising posterior distribution across a wide range of noise levels. The proposed generative denoisers exhibit desirable properties that allow fast generation while preserve the benefit of iterative refinement: (1) fast one-step generation through sampling from pure Gaussian noise at high noise levels; (2) improved sample quality by scaling the test-time compute with multi-step sampling; and (3) zero-shot probabilistic inference for flexible and controllable sampling. We evaluate NCVSD through extensive experiments, including class-conditional image generation and inverse problem solving. By scaling the test-time compute, our method outperforms teacher diffusion models and is on par with consistency models of larger sizes. Additionally, with significantly fewer NFEs than diffusion-based methods, we achieve record-breaking LPIPS on inverse problems.

Paperid:618 poster

Authors:Mingyu Xu, Bingning Wang, Weipeng Chen

Title: KV Shifting Attention Enhances Language Modeling

Abstract:
Current large language models (LLMs) predominantly rely on decode-only transformer architectures, which exhibit exceptional in-context learning (ICL) capabilities. It is widely acknowledged that the cornerstone of their ICL ability lies in the induction heads mechanism, which necessitates at least two layers of attention. To more effectively harness the model's induction capabilities, we revisit the induction heads mechanism and provide theoretical proof that KV shifting attention reduces the model's dependency on the depth and width of the induction heads mechanism. Our experimental results confirm that KV shifting attention enhances the learning of induction heads and improves language modeling performance. This leads to superior performance or accelerated convergence, spanning from toy models to pre-trained models with over 10 billion parameters.

Paperid:619 poster

Authors:Zihan Tan, Suyuan Huang, Guancheng Wan, Wenke Huang, He Li, Mang Ye

Title: S2FGL: Spatial Spectral Federated Graph Learning

Abstract:
Federated Graph Learning (FGL) combines the privacy-preserving capabilities of federated learning (FL) with the strong graph modeling capability of Graph Neural Networks (GNNs). Current research addresses subgraph-FL only from the structural perspective, neglecting the propagation of graph signals on spatial and spectral domains of the structure. From a spatial perspective, subgraph-FL introduces edge disconnections between clients, leading to disruptions in label signals and a degradation in the class knowledge of the global GNN. From a spectral perspective, spectral heterogeneity causes inconsistencies in signal frequencies across subgraphs, which makes local GNNs overfit the local signal propagation schemes. As a result, spectral client drifts occur, undermining global generalizability. To tackle the challenges, we propose a global knowledge repository to mitigate label signal disruption and a frequency alignment to address spectral client drifts. The combination ofSpatial andSpectral strategies forms our frameworkS2FGL. Extensive experiments on multiple datasets demonstrate the superiority ofS2FGL. The code is available athttps://github.com/Wonder7racer/S2FGL.git.

Paperid:620 poster

Authors:Sai Advaith Maddipatla, Nadav Bojan, Meital Bojan, Sanketh Vedula, Paul Schanda, Ailie Marx, Alexander Bronstein

Title: Inverse problems with experiment-guided AlphaFold

Abstract:
Proteins exist as a dynamic ensemble of multiple conformations, and these motions are often crucial for their functions. However, current structure prediction methods predominantly yield a single conformation, overlooking the conformational heterogeneity revealed by diverse experimental modalities. Here, we present a framework for building experiment-grounded protein structure generative models that infer conformational ensembles consistent with measured experimental data. The key idea is to treat state-of-the-art protein structure predictors (e.g., AlphaFold3) as sequence-conditioned structural priors, and cast ensemble modeling as posterior inference of protein structures given experimental measurements. Through extensive real-data experiments, we demonstrate the generality of our method to incorporate a variety of experimental measurements. In particular, our framework uncovers previously unmodeled conformational heterogeneity from crystallographic densities, generates high-accuracy NMR ensembles orders of magnitude faster than status quo, and incorporates pairwise cross-link constraints. Notably, we demonstrate that our ensembles outperform AlphaFold3 and sometimes better fit experimental data than publicly deposited structures to the protein database (PDB). We believe that this approach will unlock building predictive models that fully embrace experimentally observed conformational diversity.

Paperid:621 poster

Authors:Soheil Behnezhad, Moses Charikar, Vincent Cohen-Addad, Alma Ghafari, Weiyun ma

Title: Correlation Clustering Beyond the Pivot Algorithm

Abstract:
We study the classic correlation clustering problem. Given $n$ objects and a complete labeling of the object-pairs as either “similar�?or “dissimilar�? the goal is to partition the objects into arbitrarily many clusters while minimizing disagreements with the labels.A classic Pivot algorithm for this problem, due to [Ailon et al STOC'05], obtains a 3-approximation for this problem. Over the years, this algorithm has been successfully implemented in various settings. The downside of the Pivot algorithm is that the approximation analysis of 3 is tight for it. While better approximations have been achieved in some settings, these algorithms are often hard to implement in various settings. For example, [Behnezhad et al FOCS19] showed that the output of Pivot can be maintained in polylog time per update in a dynamic setting, a bound that was improved to constant by [Dalirrooyfard et al ICML'24]. But obtaining a better approximation remains open.In this paper, we present Modified Pivot, an algorithm that locally improves the output of Pivot. Our Modified Pivot algorithm can be implemented just as efficiently as Pivot in various settings. Our experiments show that the output of Modified Pivot on average makes less than 77% of the mistakes made by Pivot. More surprisingly, we prove theoretically that Modified Pivot has approximation ratio $3-\epsilon_0$ for some absolute constant $\epsilon_0 > 0$. This, e.g., leads to a better than 3 approximation in the dynamic setting in polylog time, improving the 3-approximation obtained by [Behnezhad et al FOCS'19] and [Dalirrooyfard et al ICML'24].

Paperid:622 poster

Authors:Yilei Chen, Aldo Pacchiano, Ioannis Paschalidis

Title: Multiple-policy Evaluation via Density Estimation

Abstract:
We study the multiple-policy evaluation problem where we are given a set of $K$ policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy $\epsilon$ with probability at least $1-\delta$. We propose an algorithm named CAESAR for this problem. Our approach is based on computing an approximately optimal sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. CAESAR has two phases. In the first phase, we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}{\epsilon})$. In the second phase, we approximate the optimal sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE objective. Up to low order and logarithmic terms CAESAR achieves a sample complexity $\tilde{O}\left(\frac{H^4}{\epsilon^2}\sum_{h=1}^H\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{\pi^k}(s,a))^2}{\mu^_h(s,a)}\right)$, where $d^{\pi}$ is the visitation distribution of policy $\pi$, $\mu^$ is the optimal sampling distribution, and $H$ is the horizon.

Paperid:623 poster

Authors:Arnob Ghosh, Mehrdad Moharrami

Title: Online Learning in Risk Sensitive constrained MDP

Abstract:
We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of $\tilde{\mathcal{O}}\big(V_{g,\max}K^{3/4} + \sqrt{H^4 S^2 A \log(1/\delta)}K^{3/4}\big)$ and constraint violation of $\tilde{\mathcal{O}}\big(V_{g,\max} \sqrt{ {H^3 S^2 A \log(1/\delta)}}K^{3/4} \big)$ with probability at least $1-\delta$, where $S$ and $A$ denote the cardinalities of the state and action spaces, respectively, $H$ is the episode length, $K$ is the number of episodes, $\alpha < 0$ is the risk-aversion parameter, and $V_{g,\max} = \frac{1}{|\alpha|}(\exp(|\alpha|H) - 1)$.To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.

Paperid:624 poster

Authors:Yi-Fan Zhang, Xiao Zhang, Min-Ling Zhang

Title: Generalization Analysis for Controllable Learning

Abstract:
Controllability has become a critical issue in trustworthy machine learning, as a controllable learner allows for dynamic model adaptation to task requirements during testing. However, existing research lacks a comprehensive understanding of how to effectively measure and analyze the generalization performance of controllable learning methods. In an attempt to move towards this goal from a generalization perspective, we first establish a unified framework for controllable learning. Then, we develop a novel vector-contraction inequality and derive a tight generalization bound for general controllable learning classes, which is independent of the number of task targets except for logarithmic factors and represents the current best-in-class theoretical result. Furthermore, we derive generalization bounds for two typical controllable learning methods: embedding-based and hypernetwork-based methods. We also upper bound the Rademacher complexities of commonly used control and prediction functions, which serve as modular theoretical components for deriving generalization bounds for specific controllable learning methods in practical applications such as recommender systems. Our theoretical results without strong assumptions provide general theoretical guarantees for controllable learning methods and offer new insights into understanding controllability in machine learning.

Paperid:625 poster

Authors:Jinhang Chai, Elynn Chen, Lin Yang

Title: Transfer Q-Learning with Composite MDP Structures

Abstract:
To bridge the gap between empirical success and theoretical understanding in transfer reinforcement learning (RL), we study a principled approach with provable performance guarantees. We introduce a novel composite MDP framework where high-dimensional transition dynamics are modeled as the sum of a low-rank component representing shared structure and a sparse component capturing task-specific variations. This relaxes the common assumption of purely low-rank transition models, allowing for more realistic scenarios where tasks share core dynamics but maintain individual variations. We introduce UCB-TQL (Upper Confidence Bound Transfer Q-Learning), designed for transfer RL scenarios where multiple tasks share core linear MDP dynamics but diverge along sparse dimensions. When applying UCB-TQL to a target task after training on a source task with sufficient trajectories, we achieve a regret bound of $\tilde{\mathcal{O}}(\sqrt{eH^5N})$ that scales independently of the ambient dimension. Here, $N$ represents the number of trajectories in the target task, while $e$ quantifies the sparse differences between tasks. This result demonstrates substantial improvement over single task RL by effectively leveraging their structural similarities. Our theoretical analysis provides rigorous guarantees for how UCB-TQL simultaneously exploits shared dynamics while adapting to task-specific variations.

Paperid:626 poster

Authors:Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Rame, Mathieu Blondel

Title: On Teacher Hacking in Language Model Distillation

Abstract:
Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model, leading to degraded performance on the true objective, in line with Goodhart's law. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust LMs.

Paperid:627 poster

Authors:Jeremy McMahan

Title: Anytime-Constrained Equilibria in Polynomial Time

Abstract:
We extend anytime constraints to the Markov game setting and the corresponding solution concept of anytime-constrained equilibrium (ACE). Then, we present a comprehensive theory of anytime-constrained equilibria that includes (1) a computational characterization of feasible policies, (2) a fixed-parameter tractable algorithm for computing ACE, and (3) a polynomial-time algorithm for approximately computing ACE. Since computing a feasible policy is NP-hard even for two-player zero-sum games, our approximation guarantees are the best possible so long as $P \neq NP$. We also develop the first theory of efficient computation for action-constrained Markov games, which may be of independent interest.

Paperid:628 poster

Authors:Wonyoung Kim, Sungwoo PARK, Garud Iyengar, assaf zeevi, Min-hwan Oh

Title: Linear Bandits with Partially Observable Features

Abstract:
We study the linear bandit problem that accounts for partially observable features. Without proper handling, unobserved features can lead to linear regret in the decision horizon $T$, as their influence on rewards is unknown. To tackle this challenge, we propose a novel theoretical framework and an algorithm with sublinear regret guarantees. The core of our algorithm consists of: (i) feature augmentation, by appending basis vectors that are orthogonal to the row space of the observed features; and (ii) the introduction of a doubly robust estimator. Our approach achieves a regret bound of $\tilde{O}(\sqrt{(d + d_h)T})$, where $d$ denotes the dimension of the observed features, and $d_h$ represents the number of nonzero coefficients in the parameter associated with the reward component projected onto the subspace orthogonal to the row space spanned by the observed features. Notably, our algorithm requires no prior knowledge of the unobserved feature space, which may expand as more features become hidden. Numerical experiments confirm that our algorithm outperforms both non-contextual multi-armed bandits and linear bandit algorithms depending solely on observed features.

Paperid:629 poster

Authors:Yu-Wei Chen, Raghu Pasupathy, Jordan Awan

Title: Optimal Survey Design for Private Mean Estimation

Abstract:
This work identifies the first privacy-aware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation, which amplifies the privacy guarantee; however, to have the same final privacy guarantee for each group, different nominal privacy budgets need to be used depending on the subsampling rate. Ignoring the effect of DP, traditional stratified sampling strategies risk significant variance inflation. We phrase our optimal survey design as an optimization problem, where we determine the optimal subsampling sizes for each group with the goal of minimizing the variance of the resulting estimator. We establish strong convexity of the variance objective, propose an efficient algorithm to identify the integer-optimal design, and offer insights on the structure of the optimal design.

Paperid:630 poster

Authors:Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan

Title: Contrastive Localized Language-Image Pre-Training

Abstract:
CLIP has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, it has been widely adopted as the vision backbone of multimodal large language models (MLLMs). The success of CLIP relies on aligning web-crawled noisy text annotations at image levels. However, such criteria may be insufficient for downstream tasks in need of fine-grained vision representations, especially when understanding region-level is demanding for MLLMs. We improve the localization capability of CLIP with several advances. Our proposed pre-training method, Contrastive Localized Language-Image Pre-training (CLOC), complements CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text labels. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

Paperid:631 poster

Authors:Philippe Chlenski, Quentin Chu, Raiyan R. Khan, Kaizhu Du, Antonio Khalil Moretti, Itsik Pe'er

Title: Mixed-curvature decision trees and random forests

Abstract:
Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation respects manifold geometry while preserving the algorithmic properties that make decision trees effective. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. In benchmarks on a diverse suite of 57 classification, regression, and link prediction tasks, our product RFs ranked first on 29 tasks and came in the top 2 for 41. This highlights the value of product RFs as straightforward yet powerful new tools for data analysis in product manifolds. Code for our method is available athttps://github.com/pchlenski/manify.

Paperid:632 poster

Authors:Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan

Title: AssistanceZero: Scalably Solving Assistance Games

Abstract:
Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe their shared goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both solving intractable decision-making problems under uncertainty and accurately modeling human users' behavior. We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over10400possible goals. Our approach, AssistanceZero, extends AlphaZero with a neural network that predicts human actions and rewards, enabling it to plan under uncertainty. We show that AssistanceZero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our AssistanceZero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft. Our results suggest that assistance games are a tractable framework for training effective AI assistants in complex environments. Code and videos are available athttps://anonymous.4open.science/w/scalably-solving-assistance-games/.

Paperid:633 poster

Authors:Jun-Qi Guo, Meng-Zhang Qian, Wei Gao, Zhi-Hua Zhou

Title: On the Diversity of Adversarial Ensemble Learning

Abstract:
Diversity has been one of the most crucial factors on the design of adversarial ensemble methods. This work focuses on the fundamental problems: How to define the diversity for the adversarial ensemble, and how to correlate with algorithmic performance. We first show that it is an NP-Hard problem to precisely calculate the diversity of two networks in adversarial ensemble learning, which makes it different from prior diversity analysis. We present the first diversity decomposition under the first-order approximation for the adversarial ensemble learning. Specifically, the adversarial ensemble loss can be decomposed into average of individual adversarial losses, gradient diversity, prediction diversity and cross diversity. Hence, it is not sufficient to merely consider the gradient diversity on the characterization of diversity as in previous adversarial ensemble methods. We present diversity decomposition for classification with cross-entropy loss similarly. Based on the theoretical analysis, we develop new ensemble method via orthogonal adversarial predictions to simultaneously improve gradient diversity and cross diversity. We finally conduct experiments to validate the effectiveness of our method.

Paperid:634 poster

Authors:Yu Liang, Yu Yang, Wenjie Wei, Ammar Belatreche, Shuai Wang, Malu Zhang, Yang Yang

Title: BSO: Binary Spiking Online Optimization Algorithm

Abstract:
Binary Spiking Neural Networks (BSNNs) offer promising efficiency advantages for resource-constrained computing. However, their training algorithms often require substantial memory overhead due to latent weights storage and temporal processing requirements. To address this issue, we propose Binary Spiking Online (BSO) optimization algorithm, a novel online training algorithm that significantly reduces training memory. BSO directly updates weights through flip signals under the online training framework. These signals are triggered when the product of gradient momentum and weights exceeds a threshold, eliminating the need for latent weights during training. To enhance performance, we propose T-BSO, a temporal-aware variant that leverages the inherent temporal dynamics of BSNNs by capturing gradient information across time steps for adaptive threshold adjustment. Theoretical analysis establishes convergence guarantees for both BSO and T-BSO, with formal regret bounds characterizing their convergence rates. Extensive experiments demonstrate that both BSO and T-BSO achieve superior optimization performance compared to existing training methods for BSNNs. The codes are available at \url{https://github.com/hamingsi/BSO}.

Paperid:635 poster

Authors:Weiwei Li, Haitao Wu, Yunan Lu, Xiuyi Jia

Title: Approximately Correct Label Distribution Learning

Abstract:
Label distribution learning (LDL) is a powerful learning paradigm that emulates label polysemy by assigning label distributions over the label space. However, existing LDL evaluation metrics struggle to capture meaningful performance differences due to their insensitivity to subtle distributional changes, and existing LDL learning objectives often exhibit biases by disproportionately emphasizing a small subset of samples with extreme predictions. As a result, the LDL metrics lose their discriminability, and the LDL objectives are also at risk of overfitting. In this paper, we propose DeltaLDL, a percentage of predictions that are approximately correct within the context of LDL, as a solution to the above problems. DeltaLDL can serve as a novel evaluation metric, which is parameter-free and reflects more on real performance improvements. DeltaLDL can also serve as a novel learning objective, which is differentiable and encourages most samples to be predicted as approximately correct, thereby mitigating overfitting. Our theoretical analysis and empirical results demonstrate the effectiveness of the proposed solution.

Paperid:636 poster

Authors:Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

Title: On the Emergence of Position Bias in Transformers

Abstract:
Recent studies have revealed various manifestations of position bias in transformer architectures, from the "lost-in-the-middle" phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework for analyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers—coupled with the causal mask—leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.

Paperid:637 poster

Authors:Ruipeng Zhang, Ya-Chien Chang, Sicun Gao

Title: When Maximum Entropy Misleads Policy Optimization

Abstract:
The Maximum Entropy Reinforcement Learning (MaxEnt RL) framework is a leading approach for achieving efficient learning and robust performance across many RL tasks. However, MaxEnt methods have also been shown to struggle with performance-critical control problems in practice, where non-MaxEnt algorithms can successfully learn. In this work, we analyze how the trade-off between robustness and optimality affects the performance of MaxEnt algorithms in complex control tasks: while entropy maximization enhances exploration and robustness, it can also mislead policy optimization, leading to failure in tasks that require precise, low-entropy policies. Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems.

Paperid:638 poster

Authors:Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, Hao Zhang

Title: Fast Video Generation with Sliding Tile Attention

Abstract:
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 950 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being \emph{hardware-efficient}. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU -- 7.17× faster than prior art methods. On the leading video DiT model, Hunyuan, it accelerates attention by 1.6�?0x over FlashAttention-3, yielding a 1.36�?.53× end-to-end speedup with no or minimum quality loss.

Paperid:639 poster

Authors:Yipeng Zhang, Longlong Li, Kelin Xia

Title: Rhomboid Tiling for Geometric Graph Deep Learning

Abstract:
Graph Neural Networks (GNNs) have proven effective for learning from graph-structured data through their neighborhood-based message passing framework. Many hierarchical graph clustering pooling methods modify this framework by introducing clustering-based strategies, enabling the construction of more expressive and powerful models. However, all of these message passing framework heavily rely on the connectivity structure of graphs, limiting their ability to capture the rich geometric features inherent in geometric graphs. To address this, we propose Rhomboid Tiling (RT) clustering, a novel clustering method based on the rhomboid tiling structure, which performs clustering by leveraging the complex geometric information of the data and effectively extracts its higher-order geometric structures. Moreover, we design RTPool, a hierarchical graph clustering pooling model based on RT clustering for graph classification tasks. The proposed model demonstrates superior performance, outperforming 21 state-of-the-art competitors on all the 7 benchmark datasets.

Paperid:640 poster

Authors:Xi Weng, Jianing An, Xudong Ma, Binhang Qi, Jie Luo, Xi Yang, Jin Song Dong, Lei Huang

Title: Clustering Properties of Self-Supervised Learning

Abstract:
Self-supervised learning (SSL) methods via joint embedding architectures have proven remarkably effective at capturing semantically rich representations with strong clustering properties, magically in the absence of label supervision. Despite this, few of them have explored leveraging these untapped properties to improve themselves. In this paper, we provide an evidence through various metrics that the encoder's outputencodingexhibits superior and more stable clustering properties compared to other components. Building on this insight, we propose a novel positive-feedback SSL method, termedRepresentationSelf-Assignment (ReSA), which leverages the model's clustering properties to promote learning in a self-guided manner. Extensive experiments on standard SSL benchmarks reveal that models pretrained with ReSA outperform other state-of-the-art SSL methods by a significant margin. Finally, we analyze how ReSA facilitates better clustering properties, demonstrating that it effectively enhances clustering performance at both fine-grained and coarse-grained levels, shaping representations that are inherently more structured and semantically meaningful.

Paperid:641 poster

Authors:Hongrui Peng, Haolang Lu, Yuanlong Yu, WeiYe Fu, Kun Wang, Guoshun Nan

Title: KGMark: A Diffusion Watermark for Knowledge Graphs

Abstract:
Knowledge graphs (KGs) are ubiquitous in numerous real-world applications, and watermarking facilitates protecting intellectual property and preventing potential harm from AI-generated content. Existing watermarking methods mainly focus on static plain text or image data, while they can hardly be applied to dynamic graphs due to spatial and temporal variations of structured data. This motivates us to propose KGMark, the first graph watermarking framework that aims to generate robust, detectable, and transparent diffusion fingerprints for dynamic KG data. Specifically, we propose a novel clustering-based alignment method to adapt the watermark to spatial variations. Meanwhile, we present a redundant embedding strategy to harden the diffusion watermark against various attacks, facilitating the robustness of the watermark to the temporal variations. Additionally, we introduce a novel learnable mask matrix to improve the transparency of diffusion fingerprints. By doing so, our KGMark properly tackles the variation challenges of structured data. Experiments on various public benchmarks show the effectiveness of our proposed KGMark.

Paperid:642 poster

Authors:Saaketh Narayan, Abhay Gupta, Mansheej Paul, Davis Blalock

Title: µnit Scaling: Simple and Scalable FP8 LLM Training

Abstract:
Large language model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently possible to train in FP8 only if one is willing to tune various hyperparameters, reduce model scale, or accept the overhead of computing dynamic scale factors. We demonstrate simple, scalable FP8 training that requires no dynamic scaling factors or special hyperparameters, even at large model sizes. Our method, \textit{µnit Scaling (µS)}, also enables simple hyperparameter transfer across model widths, matched numerics across training and inference, and other desirable properties. µnit Scaling is straightforward to implement, consisting of a set of minimal interventions based on a first-principles analysis of transformer operations. We validate our method by training models with parameters ranging from 1B to 13B, performing all hidden linear layer computations in FP8. We achieve quality equal to higher-precision baselines while also training up to 33% faster.

Paperid:643 poster

Authors:Anne Ouyang, Simon Guo, Simran Arora, Alex L Zhang, William Hu, Christopher Re, Azalia Mirhoseini

Title: KernelBench: Can LLMs Write Efficient GPU Kernels?

Abstract:
Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduceKernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric $\text{fast}_p$, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold $p$ over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold $p$.

Paperid:644 poster

Authors:Steve Hanneke, Amirreza Shaeiri

Title: A Trichotomy for List Transductive Online Learning

Abstract:
List learning is an important topic in both theoretical and empirical machine learning research, playing a key role in the recent breakthrough result of (Brukhim et al., 2022) on the characterization of multiclass PAC learnability, as well as addressing label ambiguity in computer vision classification tasks, among others. In this paper, we study the problem of list transductive online learning. In this framework, the learner outputs a list of multiple labels for each instance rather than just one, as in traditional multiclass classification. In the realizable setting, we demonstrate a trichotomy of possible rates of the minimax number of mistakes. In particular, if the learner plays for $\text{T} \in \mathbb{N}$ rounds, its minimax number of mistakes can only be of the orders $\Theta(\text{T})$, $\Theta(\log \text{T})$, or $\Theta(1)$. This resolves an open question raised by (Hanneke et al., 2024). On the other hand, in the agnostic setting, we characterize the learnability by constructively proving the $\widetilde{\mathcal{O}}(\sqrt{\text{T}})$ upper bound on the minimax expected regret. Along the way, we also answer another open question asked by (Moran et al., 2023). To establish these results, we introduce two new combinatorial complexity dimensions, called the Level-constrained $(\mathrm{L+1})$-Littlestone dimension and Level-constrained $(\mathrm{L+1})$-Branching dimension, if the list size is $\mathrm{L} \in \mathbb{N}$. Eventually, we conclude our work by raising an open question regarding eliminating the factor list size, which seems to be a crucial step, as it has consistently appeared in previous works on this subject.

Paperid:645 poster

Authors:Haibo Zhao, Dian Wang, Yizhe Zhu, Xupeng Zhu, Owen Lewis Howell, Linfeng Zhao, Yaoyao Qian, Robin Walters, Robert Platt

Title: Hierarchical Equivariant Policy via Frame Transfer

Abstract:
Recent advances in hierarchical policy learning highlight the advantages of decomposing systems into high-level and low-level agents, enabling efficient long-horizon reasoning and precise fine-grained control. However, the interface between these hierarchy levels remains underexplored, and existing hierarchical methods often ignore domain symmetry, resulting in the need for extensive demonstrations to achieve robust performance. To address these issues, we propose Hierarchical Equivariant Policy (HEP), a novel hierarchical policy framework. We propose a frame transfer interface for hierarchical policy learning, which uses the high-level agent's output as a coordinate frame for the low-level agent, providing a strong inductive bias while retaining flexibility. Additionally, we integrate domain symmetries into both levels and theoretically demonstrate the system's overall equivariance. HEP achieves state-of-the-art performance in complex robotic manipulation tasks, demonstrating significant improvements in both simulation and real-world settings.

Paperid:646 poster

Authors:Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian DuSell, Benjamin LeBrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O'Donnell, Ryan Cotterell

Title: Language Models over Canonical Byte-Pair Encodings

Abstract:
Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number ofnoncanonicaltoken encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

Paperid:647 poster

Authors:Alex Vitvitskyi, João Guilherme Madeira Araújo, Marc Lackenby, Petar Veličković

Title: What Makes a Good Feedforward Computational Graph?

Abstract:
As implied by the plethora of literature on graph rewiring, the choice of computational graph employed by a neural network can make a significant impact on its downstream performance. Certain effects related to the computational graph, such as under-reaching and over-squashing, may even render the model incapable of learning certain functions. Most of these effects have only been thoroughly studied in the domain of undirected graphs; however, recent years have seen a significant rise in interest in feedforward computational graphs: directed graphs without any back edges. In this paper, we study the desirable properties of a feedforward computational graph, discovering two important complementary measures: fidelity and mixing time, and evaluating a few popular choices of graphs through the lens of these measures. Our study is backed by both theoretical analyses of the metrics' asymptotic behaviour for various graphs, as well as correlating these metrics to the performance of trained neural network models using the corresponding graphs.

Paperid:648 poster

Authors:Qian Qi

Title: Universal Approximation Theorem of Deep Q-Networks

Abstract:
We establish a continuous-time framework for analyzing Deep Q-Networks (DQNs) via stochastic control and Forward-Backward Stochastic Differential Equations (FBSDEs). Considering a continuous-time Markov Decision Process (MDP) driven by a square-integrable martingale, we analyze DQN approximation properties. We show that DQNs can approximate the optimal Q-function on compact sets with arbitrary accuracy and high probability, leveraging residual network approximation theorems and large deviation bounds for the state-action process. We then analyze the convergence of a general Q-learning algorithm for training DQNs in this setting, adapting stochastic approximation theorems. Our analysis emphasizes the interplay between DQN layer count, time discretization, and the role of viscosity solutions (primarily for the value function $V^*$) in addressing potential non-smoothness of the optimal Q-function. This work bridges deep reinforcement learning and stochastic control, offering insights into DQNs in continuous-time settings, relevant for applications with physical systems or high-frequency data.

Paperid:649 poster

Authors:Anqi Tang, Youming Chen, Shuchen Xue, Zhaoqiang Liu

Title: Learning Single Index Models with Diffusion Priors

Abstract:
Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and high-quality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have {\em discontinuous} and {\em unknown} link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations.

Paperid:650 poster

Authors:Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai

Title: CoMemo: LVLMs Need Image Context with Image Memory

Abstract:
Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we proposeCoMemo- a dual-path architecture that combines aContext image path with an imageMemory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available athttps://lalbj.github.io/projects/CoMemo/.

Paperid:651 poster

Authors:Clément Lalanne, Franck Iutzeler, Jean-Michel Loubes, Julien Chhor

Title: On the Private Estimation of Smooth Transport Maps

Abstract:
Estimating optimal transport maps between two distributions from respective samples is an important element for many machine learning methods. To do so, rather than extending discrete transport maps, it has been shown that estimating the Brenier potential of the transport problem and obtaining a transport map through its gradient is near minimax optimal for smooth problems. In this paper, we investigate the private estimation of such potentials and transport maps with respect to the distribution samples. We propose a differentially private transport map estimator with $L^2$ error at most $n^{-1} \vee n^{-\frac{2 \alpha}{2 \alpha - 2 + d}} \vee (n\epsilon)^{-\frac{2 \alpha}{2 \alpha + d}} $ up do polylog terms where $n$ is the sample size, $\epsilon$ is the desired level of privacy, $\alpha$ is the smoothness of the true transport map, and $d$ is the dimension of the feature space. We also provide a lower bound for the problem.

Paperid:652 poster

Authors:Zhan Qu, Daniel Gomm, Michael Färber

Title: CoDy: Counterfactual Explainers for Dynamic Graphs

Abstract:
Temporal Graph Neural Networks (TGNNs) are widely used to model dynamic systems where relationships and features evolve over time. Although TGNNs demonstrate strong predictive capabilities in these domains, their complex architectures pose significant challenges for explainability. Counterfactual explanation methods provide a promising solution by illustrating how modifications to input graphs can influence model predictions. To address this challenge, we present CoDy—Counterfactual Explainer for Dynamic Graphs—a model-agnostic, instance-level explanation approach that identifies counterfactual subgraphs to interpret TGNN predictions. CoDy employs a search algorithm that combines Monte Carlo Tree Search with heuristic selection policies, efficiently exploring a vast search space of potential explanatory subgraphs by leveraging spatial, temporal, and local event impact information. Extensive experiments against state-of-the-art factual and counterfactual baselines demonstrate CoDy's effectiveness, with improvements of 16% in AUFSC+ over the strongest baseline. Our code is available at:https://github.com/daniel-gomm/CoDy

Paperid:653 poster

Authors:Ke Ren, Peyman Mohajerin Esfahani, Angelos Georghiou

Title: Inverse Optimization via Learning Feasible Regions

Abstract:
We study inverse optimization (IO), where the goal is to use a parametric optimization program as the hypothesis class to infer relationships between input-decision pairs. Most of the literature focuses on learning only the objective function, as learning the constraint function (i.e., feasible regions) leads to nonconvex training programs. Motivated by this, we focus on learning feasible regions for known linear objectives, and introduce two training losses along with a hypothesis class to parameterize the constraint function. Our hypothesis class surpasses the previous objective-only method by naturally capturing discontinuous behaviors in input-decision pairs. We introduce a customized block coordinate descent algorithm with a smoothing technique to solve the training problems, while for further restricted hypothesis classes, we reformulate the training optimization as a tractable convex program or mixed integer linear program. Synthetic experiments and two power system applications including comparisons with state-of-the-art approaches showcase and validate the proposed approach.

Paperid:654 poster

Authors:Daniel Eftekhari, Vardan Papyan

Title: On the Importance of Gaussianizing Representations

Abstract:
The normal distribution plays a central role in information theory �?it is at the same time the best-case signal and worst-case noise distribution, has the greatest representational capacity of any distribution, and offers an equivalence between uncorrelatedness and independence for joint distributions. Accounting for the mean and variance of activations throughout the layers of deep neural networks has had a significant effect on facilitating their effective training, but seldom has a prescription for precisely what distribution these activations should take, and how this might be achieved, been offered. Motivated by the information-theoretic properties of the normal distribution, we address this question and concurrently present normality normalization: a novel normalization layer which encourages normality in the feature representations of neural networks using the power transform and employs additive Gaussian noise during training. Our experiments comprehensively demonstrate the effectiveness of normality normalization, in regards to its generalization performance on an array of widely used model and dataset combinations, its strong performance across various common factors of variation such as model width, depth, and training minibatch size, its suitability for usage wherever existing normalization layers are conventionally used, and as a means to improving model robustness to random perturbations.

Paperid:655 poster

Authors:Ganchao Wei, Li Ma

Title: Stream-level Flow Matching with Gaussian Processes

Abstract:
Flow matching (FM) is a family of training algorithms for fitting continuous normalizing flows (CNFs). Conditional flow matching (CFM) exploits the fact that the marginal vector field of a CNF can be learned by fitting least-squares regression to the conditional vector field specified given one or both ends of the flow path. In this paper, we extend the CFM algorithm by defining conditional probability paths along "streams'', instances of latent stochastic paths that connect data pairs of source and target, which are modeled with Gaussian process (GP) distributions. The unique distributional properties of GPs help preserve the ``simulation-free'' nature of CFM training. We show that this generalization of the CFM can effectively reduce the variance in the estimated marginal vector field at a moderate computational cost, thereby improving the quality of the generated samples under common metrics. Additionally, adopting the GP on the streams allows for flexibly linking multiple correlated training data points (e.g., time series). We empirically validate our claim through both simulations and applications to image and neural time series data.

Paperid:656 poster

Authors:Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

Title: InfAlign: Inference-aware language model alignment

Abstract:
Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-$N$ , controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimizeinference-time win rateof the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with atransformationof the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-$N$ sampling and best-of-$N$ jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

Paperid:657 poster

Authors:Zhixin Li, Yuheng Jia, Hui LIU, Junhui Hou

Title: Learning from Sample Stability for Deep Clustering

Abstract:
Deep clustering, an unsupervised technique independent of labels, necessitates tailored supervision for model training. Prior methods explore supervision like similarity and pseudo labels, yet overlook individual sample training analysis. Our study correlates sample stability during unsupervised training with clustering accuracy and network memorization on a per-sample basis. Unstable representations across epochs often lead to mispredictions, indicating difficulty in memorization and atypicality. Leveraging these findings, we introduce supervision signals for the first time based on sample stability at the representation level. Our proposed strategy serves as a versatile tool to enhance various deep clustering techniques. Experiments across benchmark datasets showcase that incorporating sample stability into training can improve the performance of deep clustering. The code is available athttps://github.com/LZX-001/LFSS.

Paperid:658 poster

Authors:Erik Jones, Anca Dragan, Jacob Steinhardt

Title: Adversaries Can Misuse Combinations of Safe Models

Abstract:
Developers try to evaluate whether an AI system can accomplish malicious tasks before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for such misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.

Paperid:659 poster

Authors:Tobias Pielok, Bernd Bischl, David Rügamer

Title: Revisiting Unbiased Implicit Variational Inference

Abstract:
Recent years have witnessed growing interest in semi-implicit variational inference (SIVI) methods due to their ability to rapidly generate samples from complex distributions. However, since the likelihood of these samples is non-trivial to estimate in high dimensions, current research focuses on finding effective SIVI training routines. Although unbiased implicit variational inference (UIVI) has largely been dismissed as imprecise and computationally prohibitive because of its inner MCMC loop, we revisit this method and show that UIVI's MCMC loop can be effectively replaced via importance sampling and the optimal proposal distribution can be learned stably by minimizing an expected forward Kullback–Leibler divergence without bias. Our refined approach demonstrates superior performance or parity with state-of-the-art methods on established SIVI benchmarks.

Paperid:660 poster

Authors:Matthew Joseph, Alex Kulesza, Alexander Yu

Title: Approximate Differential Privacy of the�?Mechanism

Abstract:
We study the�?mechanism for computing ad-dimensional statistic with bounded�?sensitivity under approximate differential privacy. Across a range of privacy parameters, we find that the�?mechanism obtains error approaching that of the Laplace mechanism asd�?and approaching that of the Gaussian mechanism asd→∞; however, it dominates both in between.

Paperid:661 poster

Authors:Oren Mangoubi, Neil He, Nisheeth K. Vishnoi

Title: Efficient Diffusion Models for Symmetric Manifolds

Abstract:
We introduce a framework for designing efficient diffusion models ford-dimensional symmetric-space Riemannian manifolds, including the torus, sphere, special orthogonal group and unitary group. Existing manifold diffusion models often depend on heat kernels, which lack closed-form expressions and require eitherdgradient evaluations or exponential-in-darithmetic operations per training step. We introduce a new diffusion model for symmetric manifolds with a spatially-varying covariance, allowing us to leverage a projection of Euclidean Brownian motion to bypass heat kernel computations. Our training algorithm minimizes a novel efficient objective derived via Ito's Lemma, allowing each step to run inO(1)gradient evaluations and nearly-linear-in-d(O(d1.19))arithmeticoperations, reducing the gap between diffusions on symmetric manifolds and Euclidean space. Manifold symmetries ensure the diffusion satisfies an "average-case" Lipschitz condition, enabling accurate and efficient sample generation. Empirically, our model outperforms prior methods in training speed and improves sample quality on synthetic datasets on the torus, special orthogonal group, and unitary group.

Paperid:662 poster

Authors:Mengdi Zhang, Goh Kai Kiat, Peixin Zhang, Jun Sun, Lin Xin Rose, Hongyu Zhang

Title: LLMScan: Causal Scan for LLM Misbehavior Detection

Abstract:
Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

Paperid:663 poster

Authors:Valentin De Bortoli, Alexandre Galashov, J Swaroop Guntupalli, Guangyao Zhou, Kevin Patrick Murphy, Arthur Gretton, Arnaud Doucet

Title: Distributional Diffusion Models with Scoring Rules

Abstract:
Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively ``denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to speed up sample generation by learning the posterior distribution of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.

Paperid:664 poster

Authors:Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, Yuhui Yin

Title: FG-CLIP: Fine-Grained Visual and Textual Alignment

Abstract:
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with challenging fine-grained negative samples. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The data, code, and models are available athttps://github.com/360CVGroup/FG-CLIP.

Paperid:665 poster

Authors:Ho Young Jhoo, Chung-Kil Hur, Nuno P. Lopes

Title: Pfeife: Automatic Pipeline Parallelism for PyTorch

Abstract:
The memory requirements of machine learning (ML) models has been growing quickly. However, the memory capacity of GPUs has not kept pace. Despite significant research on reducing the memory usage of ML models, the larger models do not fit in a single device. A popular solution to the memory capacity issue is to use multiple devices in parallel. In this paper, we focus on a particular form of parallelism called pipelining, as it offers a good balance between cost and performance for many ML models. We present Pfeife, the first tool that integrates with PyTorch to provide automatic pipelining of ML models. Pfeife intercepts the execution of models and parallelizes them transparently, requiring no manual work. We show that Pfeife can execute large models that would otherwise not run due to not fitting in a single device. Moreover, Pfeife can pipeline non-sequential models such as Stable Diffusion, which are not supported by existing pipelining parallelism tools. Pfeife outperforms state-of-the-art tools by up to 22%.

Paperid:666 poster

Authors:Weiguo Wang, Andy Nie, Wenrui Zhou, Yi Kai, Chengchen Hu

Title: Teaching Physical Awareness to LLMs through Sounds

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in text and multimodal processing, yet they fundamentally lack physical awareness--understanding of real-world physical phenomena. In this work, we present ACORN, a framework that teaches LLMs physical awareness through sound, focusing on fundamental physical phenomena like the Doppler effect, multipath effect, and spatial relationships. To overcome data scarcity, ACORN introduce a physics-based simulator combining real-world sound sources with controlled physical channels to generate diverse training data. Using this simulator, we build AQA-PHY, a comprehensive Audio Question-Answer dataset, and propose an audio encoder that processes both magnitude and phase information. By connecting our audio encoder to state-of-the-art LLMs, we demonstrate reasonable results in both simulated and real-world tasks, such as line-of-sight detection, Doppler effect estimation, and Direction-of-Arrival estimation, paving the way for enabling LLMs to understand physical world.

Paperid:667 poster

Authors:Xiulong Yuan, Hongtao Xu, Wenting Shen, Ang Wang, Xiafei Qiu, Jie Zhang, Yuqiong Liu, Bowen Yu, Junyang Lin, Mingzhen Li, Weile Jia, Yong Li, Wei Lin

Title: Efficient Long Context Fine-tuning with Chunk Flow

Abstract:
Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.

Paperid:668 poster

Authors:Chengxing Jia, Ziniu Li, Pengyuan Wang, Yi-Chen Li, Zhenyu Hou, Yuxiao Dong, Yang Yu

Title: Controlling Large Language Model with Latent Action

Abstract:
Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of specifying the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. Inspired by reinforcement learning from observations, we proposeControlling Large Language Models withLatentActionsCoLA, a framework that integrates a latent action space into pre-trained LLMs.CoLAemploys an \emph{inverse dynamics model} to extract latent actions conditioned on future tokens, ensuring that the next token prediction is partially influenced by these actions. Simultaneously,CoLAfine-tunes the pre-trained LLM to function as a \emph{language world model}, capable of incorporating latent actions as inputs. Additionally,CoLAtrains a \emph{policy model} to generate actions within this language world model. The policy model can be trained via behavior cloning to mimic a standard language model or through RL to maximize task-specific rewards. In this work, we applyCoLAto the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions,CoLA's latent actions enable greater semantic diversity. For enhancing downstream tasks, we show thatCoLAwith RL achieves a score of 42.4 on the \emph{math500} benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore,CoLAwith RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally,CoLAreduces computation time by half in tasks involving enhanced thinking prompts for LLMs via RL. These results highlightCoLA's potential to advance RL-based adaptation of LLMs for downstream applications. The CoLA model is available at \url{https://huggingface.co/LAMDA-RL/Llama-3.1-CoLA-10B}.

Paperid:669 poster

Authors:Ilya Kaufman, Omri Azencot

Title: Curvature Enhanced Data Augmentation for Regression

Abstract:
Deep learning models with a large number of parameters, often referred to as over-parameterized models, have achieved exceptional performance across various tasks. Despite concerns about overfitting, these models frequently generalize well to unseen data, thanks to effective regularization techniques, with data augmentation being among the most widely used. While data augmentation has shown great success in classification tasks using label-preserving transformations, its application in regression problems has received less attention. Recently, a novel manifold learning approach for generating synthetic data was proposed, utilizing a first-order approximation of the data manifold. Building on this foundation, we present a theoretical framework and practical tools for approximating and sampling general data manifolds. Furthermore, we introduce the Curvature-Enhanced Manifold Sampling (CEMS) method for regression tasks. CEMS leverages a second-order representation of the data manifold to enable efficient sampling and reconstruction of new data points. Extensive evaluations across multiple datasets and comparisons with state-of-the-art methods demonstrate that CEMS delivers superior performance in both in-distribution and out-of-distribution scenarios, while introducing only minimal computational overhead. Code is available athttps://github.com/azencot-group/CEMS.

Paperid:670 poster

Authors:Felipe Areces, Christopher Mohri, Tatsunori Hashimoto, John Duchi

Title: Online Conformal Prediction via Online Optimization

Abstract:
We introduce a family of algorithms for online conformal prediction with coverage guarantees for both adversarial and stochastic data. In the adversarial setting, we establish the standard guarantee: over time, a pre-specified target fraction of confidence sets cover the ground truth. For stochastic data, we provide a guarantee at every time instead of just on average over time: the probability that a confidence set covers the ground truth—conditioned on past observations—converges to a pre-specified target when the conditional quantiles of the errors are a linear function of past data. Complementary to our theory, our experiments spanning over $15$ datasets suggest that the performance improvement of our methods over baselines grows with the magnitude of the data’s dependence, even when baselines are tuned on the test set. We put these findings to the test by pre-registering an experiment for electricity demand forecasting in Texas, where our algorithms achieve over a $10$% reduction in confidence set sizes, a more than a $30$% improvement in quantile and absolute losses with respect to the observed errors, and significant outcomes on all $78$ out of $78$ pre-registered hypotheses. We provide documentation for the pypi package implementing our algorithms here: \url{https://conformalopt.readthedocs.io/}.

Paperid:671 poster

Authors:Hayden McTavish, Zachery Boner, Jon Donnelly, Margo Seltzer, Cynthia Rudin

Title: Leveraging Predictive Equivalence in Decision Trees

Abstract:
Decision trees are widely used for interpretable machine learning due to their clearly structured reasoning process. However, this structure belies a challenge we refer to as predictive equivalence: a given tree's decision boundary can be represented by many different decision trees. The presence of models with identical decision boundaries but different evaluation processes makes model selection challenging. The models will have different variable importance and behave differently in the presence of missing values, but most optimization procedures will arbitrarily choose one such model to return. We present a boolean logical representation of decision trees that does not exhibit predictive equivalence and is faithful to the underlying decision boundary. We apply our representation to several downstream machine learning tasks. Using our representation, we show that decision trees are surprisingly robust to test-time missingness of feature values; we address predictive equivalence's impact on quantifying variable importance; and we present an algorithm to optimize the cost of reaching predictions.

Paperid:672 poster

Authors:Kevin Tan, Wei Fan, Yuting Wei

Title: Actor-Critics Can Achieve Optimal Sample Efficiency

Abstract:
Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $\epsilon$-optimal policy with a sample complexity of $O(1/\epsilon^2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/ \epsilon^2)$ trajectories, and accompanying $\sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $\log T$ rate. Here, $\mathcal{F}$ is the critic function class, and $\mathcal{A}$ is the action space. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, where we show that initializing the critic with offline data yields sample efficiency gains, and also provide a \textit{non-optimistic} provably efficient actor-critic algorithm, addressing another open problem in the literature. Numerical experiments support our theoretical findings.

Paperid:673 poster

Authors:Kelsey R Allen, Carl Doersch, Guangyao Zhou, Mohammed Suhail, Danny Driess, Ignacio Rocco, Yulia Rubanova, Thomas Kipf, Mehdi S. M. Sajjadi, Kevin Patrick Murphy, Joao Carreira, Sjoerd van Steenkiste

Title: Direct Motion Models for Assessing Generated Videos

Abstract:
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion --- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: trajan-paper.github.io.

Paperid:674 poster

Authors:Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan

Title: Overtrained Language Models Are Harder to Fine-Tune

Abstract:
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon \textbf{catastrophic overtraining}. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Paperid:675 poster

Authors:Beepul Bharti, Mary Versa Clemens-Sewall, Paul Yi, Jeremias Sulam

Title: Multiaccuracy and Multicalibration via Proxy Groups

Abstract:
As the use of predictive machine learning algorithms increases in high-stakes decision-making, it is imperative that these algorithms are fair across sensitive groups. However, measuring and enforcing fairness in real-world applications can be challenging due to missing or incomplete sensitive group information. Proxy-sensitive attributes have been proposed as a practical and effective solution in these settings, but only for parity-based fairness notions. Knowing how to evaluate and control for fairness with missing sensitive group data for newer, different, and more flexible frameworks, such as multiaccuracy and multicalibration, remains unexplored. In this work, we address this gap by demonstrating that in the absence of sensitive group data, proxy-sensitive attributes can provably be used to derive actionable upper bounds on the true multiaccuracy and multicalibration violations, providing insights into a predictive model’s potential worst-case fairness violations. Additionally, we show that adjusting models to satisfy multiaccuracy and multicalibration across proxy-sensitive attributes can significantly mitigate these violations for the true, but unknown, sensitive groups. Through several experiments on real-world datasets, we illustrate that approximate multiaccuracy and multicalibration can be achieved even when sensitive group data is incomplete or unavailable.

Paperid:676 poster

Authors:Anuran Makur, Japneet Singh

Title: Hypothesis Testing for Generalized Thurstone Models

Abstract:
In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlyinggeneralized Thurstone model$\mathcal{T}_F$ for a given choice function $F$. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for $\mathcal{T}_F$ models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of $\mathcal{T}_F$ models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as $\Theta((nk)^{-1/2})$, where $n$ is the number of agents and $k$ is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.

Paperid:677 poster

Authors:Yuhao Huang, Taos Transue, Shih-Hsin Wang, William M Feldman, Hong Zhang, Bao Wang

Title: Improving Flow Matching by Aligning Flow Divergence

Abstract:
Conditional flow matching (CFM) stands out as an efficient, simulation-free approach for training flow-based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation between probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow-based generative model by a noticeable margin without significantly raising the computational cost. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos.

Paperid:678 poster

Authors:Michael Sun, Orion Foo, Gang Liu, Wojciech Matusik, Jie Chen

Title: Directed Graph Grammars for Sequence-based Learning

Abstract:
Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammar-based approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data.

Paperid:679 poster

Authors:Yao Shu, Qixin Zhang, Kun He, Zhongxiang Dai

Title: Refining Adaptive Zeroth-Order Optimization at Ease

Abstract:
Recently, zeroth-order (ZO) optimization plays an essential role in scenarios where gradient information is inaccessible or unaffordable, such as black-box systems and resource-constrained environments. While existing adaptive methods such as ZO-AdaMM have shown promise, they are fundamentally limited by their underutilization of moment information during optimization, usually resulting in underperforming convergence. To overcome these limitations, this paper introducesRefined Adaptive Zeroth-Order Optimization(R-AdaZO). Specifically, we first show the untapped variance reduction effect of first moment estimate on ZO gradient estimation, which improves the accuracy and stability of ZO updates. We then refine the second moment estimate based on these variance-reduced gradient estimates to better capture the geometry of the optimization landscape, enabling a more effective scaling of ZO updates. We present rigorous theoretical analysis to show(a)the first analysisto the variance reduction of first moment estimate in ZO optimization,(b)the improved second moment estimateswith a more accurate approximation of its variance-free ideal,(c)the first variance-aware convergence frameworkfor adaptive ZO methods, which may be of independent interest, and(d)the faster convergenceof R-AdaZO than existing baselines like ZO-AdaMM. Our extensive experiments, including synthetic problems, black-box adversarial attack, and memory-efficient fine-tuning of large language models (LLMs), further verify the superior convergence of R-AdaZO, indicating that R-AdaZO offers an improved solution for real-world ZO optimization challenges.

Paperid:680 poster

Authors:Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang

Title: De-mark: Watermark Removal in Large Language Models

Abstract:
Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models (LMs). However, the robustness of the watermarking schemes has not been well explored. In this paper, we present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark and identifying the red-green list within the n-gram watermark. Experiments on popular LMs, such as Llama3 and ChatGPT, demonstrate the efficiency and effectiveness of De-mark in watermark removal and exploitation tasks.

Paperid:681 poster

Authors:Joanna Waczynska, Tomasz Szczepanik, Piotr Borycki, Slawomir Tadeja, Thomas Bohné, Przemysław Spurek

Title: MiraGe: Editable 2D Images using Gaussian Splatting

Abstract:
Implicit Neural Representations (INRs) approximate discrete data through continuous functions and are commonly used for encoding 2D images. Traditional image-based INRs employ neural networks to map pixel coordinates to RGB values, capturing shapes, colors, and textures within the network’s weights. Recently, GaussianImage has been proposed as an alternative, using Gaussian functions instead of neural networks to achieve comparable quality and compression. Such a solution obtains a quality and compression ratio similar to classical INR models but does not allow image modification. In contrast, our work introduces a novel method, MiraGe, which uses mirror reflections to perceive 2D images in 3D space and employs flat-controlled Gaussians for precise 2D image editing. Our approach improves the rendering quality and allows realistic image modifications, including human-inspired perception of photos in the 3D world. Thanks to modeling images in 3D space, we obtain the illusion of 3D-based modification in 2D images. We also show that our Gaussian representation can be easily combined with a physics engine to produce physics-based modification of 2D images. Consequently, MiraGe allows for better quality than the standard approach and natural modification of 2D images.

Paperid:682 poster

Authors:Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang

Title: LongRoPE2: Near-Lossless LLM Context Window Scaling

Abstract:
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length.

Paperid:683 poster

Authors:Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou

Title: How to Synthesize Text Data without Model Collapse?

Abstract:
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-nmodels will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

Paperid:684 poster

Authors:Honghua Zhang, Meihua Dang, Benjie Wang, Stefano Ermon, Nanyun Peng, Guy Van den Broeck

Title: Scaling Probabilistic Circuits via Monarch Matrices

Abstract:
Probabilistic Circuits (PCs) are tractable representations of probability distributions allowing for exact and efficient computation of likelihoods and marginals. Recent advancements have improved the scalability of PCs either by leveraging their sparse properties or through the use of tensorized operations for better hardware utilization. However, no existing method fully exploits both aspects simultaneously. In this paper, we propose a novel sparse and structured parameterization for the sum blocks in PCs. By replacing dense matrices with sparse Monarch matrices, we significantly reduce the memory and computation costs, enabling unprecedented scaling of PCs. From a theory perspective, our construction arises naturally from circuit multiplication; from a practical perspective, compared to previous efforts on scaling up tractable probabilistic models, our approach not only achieves state-of-the-art generative modeling performance on challenging benchmarks like Text8, LM1B and ImageNet, but also demonstrates superior scaling behavior, achieving the same performance with substantially less compute as measured by the number of floating-point operations (FLOPs) during training.

Paperid:685 poster

Authors:Juyun Wee, Minjae Park, Jaeho Lee

Title: Prompt-based Depth Pruning of Large Language Models

Abstract:
Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent---a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routedDynamicDepth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.

Paperid:686 poster

Authors:Mostafa Elhoushi, Jeff Johnson

Title: any4: Learned 4-bit Numeric Representation for LLMs

Abstract:
We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code athttps://github.com/facebookresearch/any4.

Paperid:687 poster

Authors:Kanghee Park, Timothy Zhou, Loris D'Antoni

Title: Flexible and Efficient Grammar-Constrained Decoding

Abstract:
Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can ``align'' with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Paperid:688 poster

Authors:Hongwei Zhang, Ziqi Ye, Xinyuan Wang, Xin Guo, Zenglin Xu, Yuan Cheng, Zixin Hu, Yuan Qi

Title: Efficient Network Automatic Relevance Determination

Abstract:
We propose Network Automatic Relevance Determination (NARD), an extension of ARD for linearly probabilistic models, to simultaneously model sparse relationships between inputsX∈Rd×Nand outputsY∈Rm×N, while capturing the correlation structure among theY. NARD employs a matrix normal prior which contains a sparsity-inducing parameter to identify and discard irrelevant features, thereby promoting sparsity in the model. Algorithmically, it iteratively updates both the precision matrix and the relationship betweenYand the refined inputs. To mitigate the computational inefficiencies of theO(m3+d3)cost per iteration, we introduce Sequential NARD, which evaluates features sequentially, and a Surrogate Function Method, leveraging an efficient approximation of the marginal likelihood and simplifying the calculation of determinant and inverse of an intermediate matrix. Combining the Sequential update with the Surrogate Function method further reduces computational costs. The computational complexity per iteration for these three methods is reduced toO(m3+p3),O(m3+d2),O(m3+p2)respectively, wherep≪dis the final number of features in the model. Our methods demonstrate significant improvements in computational efficiency with comparable performance on both synthetic and real-world datasets.

Paperid:689 poster

Authors:Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr

Title: Mixture of Experts Made Intrinsically Interpretable

Abstract:
Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a mixture-of-experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. however, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.

Paperid:690 poster

Authors:Zhen Sun, Yunhang Shen, Jie Li, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji

Title: DS-VLM: Diffusion Supervision Vision Language Model

Abstract:
Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.

Paperid:691 poster

Authors:Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

Title: Random Registers for Cross-Domain Few-Shot Learning

Abstract:
Cross-domain few-shot learning (CDFSL) aims to transfer knowledge from a data-sufficient source domain to data-scarce target domains. Although Vision Transformer (ViT) has shown superior capability in many vision tasks, its transferability against huge domain gaps in CDFSL is still under-explored. In this paper, we find an intriguing phenomenon: during the source-domain training, prompt tuning, as a common way to train ViT, could be harmful for the generalization of ViT in target domains, but setting them to random noises (i.e., random registers) could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find that learnable prompts capture domain information during the training on the source dataset, which views irrelevant visual patterns as vital cues for recognition. This can be viewed as a kind of overfitting and increases the sharpness of the loss landscapes. In contrast, random registers are essentially a novel way of perturbing attention for the sharpness-aware minimization, which helps the model find a flattened minimum in loss landscapes, increasing the transferability. Based on this phenomenon and interpretation, we further propose a simple but effective approach for CDFSL to enhance the perturbation on attention maps by adding random registers on the semantic regions of image tokens, improving the effectiveness and efficiency of random registers. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Codes and models are available athttps://github.com/shuaiyi308/REAP.

Paperid:692 poster

Authors:Hao Gu, Wei Li, Lujun Li, Zhu Qiyuan, Mark G. Lee, Shengjie Sun, Wei Xue, Yike Guo

Title: Delta Decompression for MoE-based LLMs Compression

Abstract:
Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compacts MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60% compression rates. Codes are available inhttps://github.com/lliai/D2MoE.

Paperid:693 poster

Authors:Alexander Luke Ian Norcliffe, Changhee Lee, Fergus Imrie, Mihaela van der Schaar, Pietro Lio

Title: Stochastic Encodings for Active Feature Acquisition

Abstract:
Active Feature Acquisition is an instance-wise, sequential decision making problem. The aim is to dynamically select which feature to measure based on current observations, independently for each test instance. Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic acquisitions. To address these shortcomings, we introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a stochastic latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.

Paperid:694 poster

Authors:Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, Jiaxuan You

Title: ResearchTown: Simulator of Human Research Community

Abstract:
Large Language Models (LLMs) have demonstrated remarkable potential in scientific domains, yet a fundamental question remains unanswered: Can we simulate human research communities with LLMs? Addressing this question can deepen our understanding of the processes behind idea brainstorming and inspire the automatic discovery of novel scientific insights. In this work, we propose ResearchTown, a multi-agent framework for research community simulation. Within this framework, the human research community is simplified as an agent-data graph, where researchers and papers are represented as agent-type and data-type nodes, respectively, and connected based on their collaboration relationships. We also introduce TextGNN, a text-based inference framework that models various research activities (e.g., paper reading, paper writing, and review writing) as special forms of a unified message-passing process on the agent-data graph. To evaluate the quality of the research community simulation, we present ResearchBench, a benchmark that uses a node-masking prediction task for scalable and objective assessment based on similarity. Our experiments reveal three key findings: (1) ResearchTown can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) ResearchTown can maintain robust simulation with multiple researchers and diverse papers; (3) ResearchTown can generate interdisciplinary research ideas that potentially inspire pioneering research directions.

Paperid:695 poster

Authors:Jonggeon Park, Giung Nam, Hyunsu Kim, Jongmin Yoon, Juho Lee

Title: Ensemble Distribution Distillation via Flow Matching

Abstract:
Neural network ensembles have proven effective in improving performance across a range of tasks; however, their high computational cost limits their applicability in resource-constrained environments or for large models. Ensemble distillation, the process of transferring knowledge from an ensemble teacher to a smaller student model, offers a promising solution to this challenge. The key is to ensure that the student model is both cost-efficient and achieves performance comparable to the ensemble teacher. With this in mind, we propose a novel ensemble distribution distillation method, which leverages flow matching to effectively transfer the diversity from the ensemble teacher to the student model. Our extensive experiments demonstrate the effectiveness of our proposed method compared to existing ensemble distillation approaches.

Paperid:696 poster

Authors:Udaya Ghai, Karan Singh

Title: Sample-Optimal Agnostic Boosting with Unlabeled Data

Abstract:
Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sample-efficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity.We highlight a previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples needed, unlabeled and labeled inclusive, is never more than that for the best known agnostic boosting algorithm -- so this result is never worse -- while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learning-theoretic applications of agnostic boosting, which often take place in the distribution-specific setting, where unlabeled samples can be availed for free. We also prove that the resultant guarantee is resilient against mismatch between the distributions governing the labeled and unlabeled samples. Finally, we detail an application of this result in reinforcement learning.

Paperid:697 poster

Authors:Piyush Anand, Piotr Indyk, Ravishankar Krishnaswamy, Sepideh Mahabadi, Vikas C. Raykar, Kirankumar Shiragur, Haike Xu

Title: Graph-Based Algorithms for Diverse Similarity Search

Abstract:
Nearest neighbor search is a fundamental data structure problem with many applications. Although the main objective of the data structure is to quickly report data points that are closest to a given query, it has long been noted that without additional constraints the reported answers can be redundant and/or duplicative. This issue is typically addressed in two stages: in the first stage, the algorithm retrieves a (large) number $r$ of points closest to the query, while in the second stage, the $r$ points are post-processed and a small subset is selected to maximize the desired diversity objective. Although popular, this method suffers from a fundamental efficiency bottleneck, as the set of points retrieved in the first stage often needs to be much larger than the final output. In this paper we present provably efficient algorithms for approximate nearest neighbor search with diversity constraints that bypass this two stage process. Our algorithms are based on popular graph-based methods, which allows us to ``piggy-back'' on the existing efficient implementations. These are the first graph-based algorithms for nearest neighbor search with diversity constraints. For data sets with low intrinsic dimension, our data structures report a diverse set of $k$ points approximately closest to the query, in time that only depends on $k$ and $\log \Delta$, where $\Delta$ is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach.

Paperid:698 poster

Authors:Connor Lawless, Tsui-Wei Weng, Berk Ustun, Madeleine Udell

Title: Understanding Fixed Predictions via Confined Regions

Abstract:
Machine learning models can assign fixed predictions that preclude individuals from changing their outcome. Existing approaches to audit fixed predictions do so on a pointwise basis, which requires access to an existing dataset of individuals and may fail to anticipate fixed predictions in out-of-sample data. This work presents a new paradigm to identify fixed predictions by finding confined regions of the feature space in which all individuals receive fixed predictions. This paradigm enables the certification of recourse for out-of-sample data, works in settings without representative datasets, and provides interpretable descriptions of individuals with fixed predictions. We develop a fast method to discover confined regions for linear classifiers using mixed-integer quadratically constrained programming. We conduct a comprehensive empirical study of confined regions across diverse applications. Our results highlight that existing pointwise verification methods fail to anticipate future individuals with fixed predictions, while our method both identifies them and provides an interpretable description.

Paperid:699 poster

Authors:Jasper Dekoninck, Maximilian Baader, Martin Vechev

Title: A Unified Approach to Routing and Cascading for LLMs

Abstract:
The availability of a wide range of large language models (LLMs) embedded in various agentic systems has significantly increased the potential of model selection strategies to improve the cost-performance tradeoff. Existing strategies involve either routing, where a single model is chosen per query, or cascading, which sequentially runs increasingly larger models until a satisfactory answer is found. However, current approaches face three key limitations: they (1) lack formal proofs of optimality, (2) fail to identify the conditions under which these strategies are most effective to improve the cost-performance tradeoff, and (3) are unable to combine both paradigms for further improvements. To address these issues, we first derive a novel optimal strategy for cascading and prove the optimality of an existing routing strategy. Further, we proposecascade routing, a unified framework that integrates routing and cascading into a theoretically optimal strategy. Through our analysis, we identify good quality estimators as the critical factor for the success of model selection paradigms. Finally, in our experiments, we show that cascade routing consistently outperforms the individual approaches by a large margin and we analyze quality estimators to determine when routing and/or cascading are useful paradigms for model selection.

Paperid:700 poster

Authors:Lan Feng, Fan Nie, Yuejiang Liu, Alexandre Alahi

Title: TAROT: Targeted Data Selection via Optimal Transport

Abstract:
We propose TAROT, a targeted data selection framework grounded in Optimal Transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, such heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary limitations: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, offering a more reliable measure of data influence. Building on this, TAROT leverages whitened feature distance to quantify and minimize the optimal transport distance between selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, demonstrating its versatility across various deep learning tasks. Code is available at:https://github.com/vita-epfl/TAROT.

Paperid:701 poster

Authors:Kaiying Hou, David Brandfonbrener, Sham M. Kakade, Samy Jelassi, Eran Malach

Title: Universal Length Generalization with Turing Programs

Abstract:
Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we proposeTuring Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.

Paperid:702 poster

Authors:Leonard Papenmeier, Matthias Poloczek, Luigi Nardi

Title: Understanding High-Dimensional Bayesian Optimization

Abstract:
Recent work reported that simple Bayesian optimization (BO) methods perform well for high-dimensional real-world tasks, seemingly contradicting prior work and tribal knowledge. This paper investigates why. We identify underlying challenges that arise in high-dimensional BO and explain why recent methods succeed. Our empirical analysis shows that vanishing gradients caused by Gaussian process (GP) initialization schemes play a major role in the failures of high-dimensional Bayesian optimization (HDBO) and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation (MLE) of GP length scales suffices for state-of-the-art performance. Based on this, we propose a simple variant of MLE called MSR that leverages these findings to achieve state-of-the-art performance on a comprehensive set of real-world applications. We present targeted experiments to illustrate and confirm our findings.

Paperid:703 poster

Authors:Guangzheng Hu, Feng Liu, Mingming Gong, Guanghui Wang, Liuhua Peng

Title: Learning Imbalanced Data with Beneficial Label Noise

Abstract:
Data imbalance is a common factor hindering classifier performance. Data-level approaches for imbalanced learning, such as resampling, often lead to information loss or generative errors. Building on theoretical studies of imbalance ratio in binary classification, it is found that adding suitable label noise can adjust biased decision boundaries and improve classifier performance. This paper proposes the Label-Noise-based Re-balancing (LNR) approach to solve imbalanced learning by employing a novel design of an asymmetric label noise model. In contrast to other data-level methods, LNR alleviates the issues of informative loss and generative errors and can be integrated seamlessly with any classifier or algorithm-level method. We validated the superiority of LNR on synthetic and real-world datasets. Our work opens a new avenue for imbalanced learning, highlighting the potential of beneficial label noise.

Paperid:704 poster

Authors:Aadirupa Saha, Tomer Koren, Yishay Mansour

Title: Dueling Convex Optimization with General Preferences

Abstract:
We address the problem of convex optimization with dueling feedback, where the goal is to minimize a convex function given a weaker form of \emph{dueling} feedback. Each query consists of two points and the dueling feedback returns a (noisy) single-bit binary comparison of the function values of the two queried points. The translation of the function values to the single comparison bit is through a \emph{transfer function}. This problem has been addressed previously for some restricted classes of transfer functions, but here we consider a very general transfer function class which includes all functions that admit a series expansion about the origin. Our main contribution is an efficient algorithm with convergence rate of $O(\epsilon^{-4p})$ for smooth convex functions, and an optimal rate of $\widetilde O(\epsilon^{-2p})$ when the objective is both smooth and strongly convex, where $p$ is the minimal degree (with a non-zero coefficient) in the transfer's series expansion about the origin.

Paperid:705 poster

Authors:Alekh Agarwal, Christoph Dann, Teodor Vanislavov Marinov

Title: Design Considerations in Offline Preference-based RL

Abstract:
Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

Paperid:706 poster

Authors:Jiacheng Du, Zhibo Wang, Jie Zhang, Xiaoyi Pang, Jiahui Hu, Kui Ren

Title: Textual Unlearning Gives a False Sense of Unlearning

Abstract:
Language Models (LMs) are prone to ''memorizing'' training data, including substantial sensitive user information. To mitigate privacy risks and safeguard the right to be forgotten, machine unlearning has emerged as a promising approach for enabling LMs to efficiently ''forget'' specific texts. However, despite the good intentions, is textual unlearning really as effective and reliable as expected? To address the concern, we first propose Unlearning Likelihood Ratio Attack+ (U-LiRA+), a rigorous textual unlearning auditing method, and find that unlearned texts can still be detected with very high confidence after unlearning. Further, we conduct an in-depth investigation on the privacy risks of textual unlearning mechanisms in deployment and present the Textual Unlearning Leakage Attack (TULA), along with its variants in both black- and white-box scenarios. We show that textual unlearning mechanisms could instead reveal more about the unlearned texts, exposing them to significant membership inference and data reconstruction risks. Our findings highlight that existing textual unlearning actually gives a false sense of unlearning, underscoring the need for more robust and secure unlearning mechanisms.

Paperid:707 poster

Authors:Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart

Title: Towards Cost-Effective Reward Guided Text Generation

Abstract:
Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training as in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a single call to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.

Paperid:708 poster

Authors:Wanli Hong, Yuliang Shi, Jonathan Niles-Weed

Title: Trajectory Inference with Smooth Schrödinger Bridges

Abstract:
Motivated by applications in trajectory inference and particle tracking, we introduceSmooth Schrödinger Bridges. Our proposal generalizes prior work by allowing the reference process in the multi-marginal Schrödinger Bridge problem to be a smooth Gaussian process, leading to more regular and interpretable trajectories in applications. Though naïvely smoothing the reference process leads to a computationally intractable problem, we identify a class of processes (including the Matérn processes) for which the resulting Smooth Schrödinger Bridge problem can beliftedto a simpler problem on phase space, which can be solved in polynomial time. We develop a practical approximation of this algorithm that outperforms existing methods on numerous simulated and real single-cell RNAseq datasets.

Paperid:709 poster

Authors:Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

Title: Weak-to-Strong Jailbreaking on Large Language Models

Abstract:
Large language models (LLMs) are vulnerable to jailbreak attacks -- resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose theweak-to-strongjailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack's key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model's decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available athttps://github.com/XuandongZhao/weak-to-strong.

Paperid:710 poster

Authors:Yang Li

Title: Policy Guided Tree Search for Enhanced LLM Reasoning

Abstract:
Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chain-of-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.

Paperid:711 poster

Authors:Bing Liu, Wenjun Miao, Boyu Zhang, Qiankun Zhang, Bin Yuan, Jing Wang, Shenghao Liu, Xianjun Deng

Title: Understanding the Unfairness in Network Quantization

Abstract:
Network quantization, one of the most widely studied model compression methods, effectively quantizes a floating-point model to obtain a fixed-point one with negligible accuracy loss. Although great success was achieved in reducing the model size, it may exacerbate the unfairness in model accuracy across different groups of datasets. This paper considers two widely used algorithms: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), with an attempt to understand how they cause this critical issue. Theoretical analysis with empirical verifications reveals two responsible factors, as well as how they influence a metric of fairness in depth. A comparison between PTQ and QAT is then made, explaining an observation that QAT behaves even worse than PTQ in fairness, although it often preserves a higher accuracy at lower bit-widths in quantization. Finally, the paper finds out that several simple data augmentation methods can be adopted to alleviate the disparate impacts of quantization, based on a further observation that class imbalance produces distinct values of the aforementioned factors among different attribute classes. We experiment on either imbalanced (UTK-Face and FER2013) or balanced (CIFAR-10 and MNIST) datasets using ResNet and VGG models for empirical evaluation.

Paperid:712 poster

Authors:Zehong Wang, Zheyuan Zhang, Tianyi Ma, Nitesh V Chawla, Chuxu Zhang, Yanfang Ye

Title: Beyond Message Passing: Neural Graph Pattern Machine

Abstract:
Graph learning tasks often hinge on identifying key substructure patterns---such as triadic closures in social networks or benzene rings in molecular graphs---that underpin downstream performance. However, most existing graph neural networks (GNNs) rely on message passing, which aggregates local neighborhood information iteratively and struggles to explicitly capture such fundamental motifs, like triangles,k-cliques, and rings. This limitation hinders both expressiveness and long-range dependency modeling. In this paper, we introduce the Neural Graph Pattern Machine (GPM), a novel framework that bypasses message passing by learning directly from graph substructures. GPM efficiently extracts, encodes, and prioritizes task-relevant graph patterns, offering greater expressivity and improved ability to capture long-range dependencies. Empirical evaluations across four standard tasks---node classification, link prediction, graph classification, and graph regression---demonstrate that GPM outperforms state-of-the-art baselines. Further analysis reveals that GPM exhibits strong out-of-distribution generalization, desirable scalability, and enhanced interpretability. Code and datasets are available at:https://github.com/Zehong-Wang/GPM.

Paperid:713 poster

Authors:Shuo Chen, Chen Gong, Jun Li, Jian Yang

Title: Volume-Aware Distance for Robust Similarity Learning

Abstract:
Measuring the similarity between data points plays a vital role in lots of popular representation learning tasks such as metric learning and contrastive learning. Most existing approaches utilize point-level distances to learn the point-to-point similarity between pairwise instances. However, since the finite number of training data points cannot fully cover the whole sample space consisting of an infinite number of points, the generalizability of the learned distance is usually limited by the sample size. In this paper, we thus extend the conventional form of data point to the new form of data ball with a predictable volume, so that we can naturally generalize the existing point-level distance to a new volume-aware distance (VAD) which measures the field-to-field geometric similarity. The learned VAD not only takes into account the relationship between observed instances but also uncovers the similarity among those unsampled neighbors surrounding the training data. This practice significantly enriches the coverage of sample space and thus improves the model generalizability. Theoretically, we prove that VAD tightens the error bound of traditional similarity learning and preserves crucial topological properties. Experiments on multi-domain data demonstrate the superiority of VAD over existing approaches in both supervised and unsupervised tasks.

Paperid:714 poster

Authors:Haifeng Wen, Hong Xing, Osvaldo Simeone

Title: Distributed Conformal Prediction via Message Passing

Abstract:
Post-hoc calibration of pre-trained models is critical for ensuring reliable inference, especially in safety-critical domains such as healthcare. Conformal Prediction (CP) offers a robust post-hoc calibration framework, providing distribution-free statistical coverage guarantees for prediction sets by leveraging held-out datasets. In this work, we address a decentralized setting where each device has limited calibration data and can communicate only with its neighbors over an arbitrary graph topology. We propose two message-passing-based approaches for achieving reliable inference via CP: quantile-based distributed conformal prediction (Q-DCP) and histogram-based distributed conformal prediction (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence, while H-DCP uses a consensus-based histogram estimation approach. Through extensive experiments, we investigate the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes across different network topologies. The code of our work is released on:https://github.com/HaifengWen/Distributed-Conformal-Prediction.

Paperid:715 poster

Authors:Harry Zhao, Tristan Sylvain, Romain Laroche, Doina Precup, Yoshua Bengio

Title: Rejecting Hallucinated State Targets during Planning

Abstract:
Generative models can be used in planning to propose targets corresponding to states that agents deem either likely or advantageous to experience. However, imperfections, common in learned models, lead to infeasible hallucinated targets, which can cause delusional behaviors and thus safety concerns. This work first categorizes and investigates the properties of several kinds of infeasible targets. Then, we devise a strategy to reject infeasible targets with a generic target evaluator, which trains alongside planning agents as an add-on without the need to change the behavior nor the architectures of the agent (and the generative model) it is attached to. We highlight that, without proper design, the evaluator can produce delusional estimates, rendering the strategy futile. Thus, to learn correct evaluations of infeasible targets, we propose to use a combination of learning rule, architecture, and two assistive hindsight relabeling strategies. Our experiments validate significant reductions in delusional behaviors and performance improvements for several kinds of existing planning agents.

Paperid:716 poster

Authors:Suyuan Liu, Hao Yu, Hao Tan, KE LIANG, Siwei Wang, Shengju Yu, En Zhu, Xinwang Liu

Title: Efficient Federated Incomplete Multi-View Clustering

Abstract:
Multi-view clustering (MVC) leverages complementary information from diverse data sources to enhance clustering performance. However, its practical deployment in distributed and privacy-sensitive scenarios remains challenging. Federated multi-view clustering (FMVC) has emerged as a potential solution, but existing approaches suffer from substantial limitations, including excessive communication overhead, insufficient privacy protection, and inadequate handling of missing views. To address these issues, we propose Efficient Federated Incomplete Multi-View Clustering (EFIMVC), a novel framework that introduces a localized optimization strategy to significantly reduce communication costs while ensuring theoretical convergence. EFIMVC employs both view-specific and shared anchor graphs as communication variables, thereby enhancing privacy by avoiding the transmission of sensitive embeddings. Moreover, EFIMVC seamlessly extends to scenarios with missing views, making it a practical and scalable solution for real-world applications. Extensive experiments on benchmark datasets demonstrate the superiority of EFIMVC in clustering accuracy, communication efficiency, and privacy preservation. Our code is publicly available athttps://github.com/Tracesource/EFIMVC.

Paperid:717 poster

Authors:Kushagra Pandey, Farrin Marouf Sofian, Felix Draxler, Theofanis Karaletsos, Stephan Mandt

Title: Variational Control for Guidance in Diffusion Models

Abstract:
Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing \emph{Diffusion Trajectory Matching (DTM)} that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available athttps://github.com/czi-ai/oc-guidance.

Paperid:718 poster

Authors:Iurii Kemaev, Dan A. Calian, Luisa M Zintgraf, Gregory Farquhar, Hado van Hasselt

Title: Scalable Meta-Learning via Mixed-Mode Differentiation

Abstract:
Gradient-based bilevel optimisation is a powerful technique with applications in hyperparameter optimisation, task adaptation, algorithm discovery, meta-learning more broadly, and beyond. It often requires differentiating through the gradient-based optimisation process itself, leading to "gradient-of-a-gradient" calculations with computationally expensive second-order and mixed derivatives. While modern automatic differentiation libraries provide a convenient way to write programs for calculating these derivatives, they oftentimes cannot fully exploit the specific structure of these problems out-of-the-box, leading to suboptimal performance. In this paper, we analyse such cases and propose Mixed-Flow Meta-Gradients, or MixFlow-MG -- a practical algorithm that uses mixed-mode differentiation to construct more efficient and scalable computational graphs yielding over 10x memory and up to 25% wall-clock time improvements over standard implementations in modern meta-learning setups.

Paperid:719 poster

Authors:Kazuki Egashira, Robin Staab, Mark Vero, Jingxuan He, Martin Vechev

Title: Mind the Gap: A Practical Attack on GGUF Quantization

Abstract:
With the increasing size of frontier LLMs, post-training quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and llama.cpp frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error -- the difference between the full-precision weights and their (de-)quantized version -- provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation ($\Delta$=$88.7%$), targeted content injection ($\Delta$=$85.0%$), and benign instruction refusal ($\Delta$=$30.1%$). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient as a defense.

Paperid:720 poster

Authors:Sanghoon Yu, Min-hwan Oh

Title: Optimal and Practical Batched Linear Bandit Algorithm

Abstract:
We study the linear bandit problem under limited adaptivity, known as the batched linear bandit. While existing approaches can achieve near-optimal regret in theory, they are often computationally prohibitive or underperform in practice. We propose BLAE, a novel batched algorithm that integrates arm elimination with regularized G-optimal design, achieving the minimax optimal regret (up to logarithmic factors in $T$) in both large-$K$ and small-$K$ regimes for the first time, while using only $O(\log\log T)$ batches. Our analysis introduces new techniques for batch-wise optimal design and refined concentration bounds. Crucially, BLAE demonstrates low computational overhead and strong empirical performance, outperforming state-of-the-art methods in extensive numerical evaluations. Thus, BLAE is the first algorithm to combine provable minimax-optimality in all regimes and practical superiority in batched linear bandits.

Paperid:721 poster

Authors:Tao Tao, Darshil Doshi, Dayal Singh Kalra, Tianyu He, Maissam Barkeshli

Title: (How) Can Transformers Predict Pseudo-Random Numbers?

Abstract:
Transformers excel at discovering patterns in sequential data, yet their fundamental limitations and learning mechanisms remain crucial topics of investigation. In this paper, we study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs), defined by the recurrence relation $x_{t+1} = a x_t + c \;\mathrm{mod}\; m$. We find that with sufficient architectural capacity and training data variety, Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$). By analyzing the embedding layers and attention patterns, we uncover how Transformers develop algorithmic structures to learn these sequences in two scenarios of increasing complexity. First, we investigate how Transformers learn LCG sequences with unseen ($a, c$) but fixed modulus; and demonstrate successful learning up to $m = 2^{32}$. We find that models learn to factorize $m$ and utilize digit-wise number representations to make sequential predictions. In the second, more challenging scenario of unseen moduli, we show that Transformers can generalize to unseen moduli up to $m_{\text{test}} = 2^{16}$. In this case, the model employs a two-step strategy: first estimating the unknown modulus from the context, then utilizing prime factorizations to generate predictions. For this task, we observe a sharp transition in the accuracy at a critical depth $d= 3$. We also find that the number of in-context sequence elements needed to reach high accuracy scales sublinearly with the modulus.

Paperid:722 poster

Authors:Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, Margret Keuper

Title: DCBM: Data-Efficient Visual Concept Bottleneck Models

Abstract:
Concept Bottleneck Models (CBMs) enhance the interpretability of neural networks by basing predictions on human-understandable concepts. However, current CBMs typically rely on concept sets extracted from large language models or extensive image corpora, limiting their effectiveness in data-sparse scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for large sample sizes during concept generation while preserving interpretability. DCBMs define concepts as image regions detected by segmentation or detection foundation models, allowing each image to generate multiple concepts across different granularities. Exclusively containing dataset-specific concepts, DCBMs are well suited for fine-grained classification and out-of-distribution tasks. Attribution analysis using Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized in test images. By leveraging dataset-specific concepts instead of predefined or general ones, DCBMs enhance adaptability to new domains. The code is available at:https://github.com/KathPra/DCBM.

Paperid:723 poster

Authors:Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Title: Automatic Reward Shaping from Confounded Offline Data

Abstract:
Reward shaping has been demonstrated to be an effective technique for accelerating the learning process of reinforcement learning (RL) agents. While successful in empirical applications, the design of a good shaping function is less well understood in principle and thus often relies on domain expertise and manual design. To overcome this limitation, we propose a novel automated approach for designing reward functions from offline data, possibly contaminated with the unobserved confounding bias.We propose to use causal state value upper bounds calculated from offline datasets as a conservative optimistic estimation of the optimal state value, which is then used as state potentials in Potential-Based Reward Shaping (PBRS). When applying our shaping function to a model-free learner based on UCB principles, we show that it enjoys a better gap-dependent regret bound than the learner without shaping. To the best of our knowledge, this is the first gap-dependent regret bound for PBRS in model-free learning with online exploration. Simulations support the theoretical findings.

Paperid:724 poster

Authors:Alexander Lobashev, Dmitry Guskov, Maria Larchenko, Mikhail Tamm

Title: Hessian Geometry of Latent Space in Generative Models

Abstract:
This paper presents a novel method for analyzing the latent space geometry of generative models, including statistical physics models and diffusion models, by reconstructing the Fisher information metric. The method approximates the posterior distribution of latent variables given generated samples and uses this to learn the log-partition function, which defines the Fisher metric for exponential families. Theoretical convergence guarantees are provided, and the method is validated on the Ising and TASEP models, outperforming existing baselines in reconstructing thermodynamic quantities. Applied to diffusion models, the method reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher metric. We demonstrate that while geodesic interpolations are approximately linear within individual phases, this linearity breaks down at phase boundaries, where the diffusion model exhibits a divergent Lipschitz constant with respect to the latent space. These findings provide new insights into the complex structure of diffusion model latent spaces and their connection to phenomena like phase transitions. Our source code is available at \url{https://github.com/alobashev/hessian-geometry-of-diffusion-models}.

Paperid:725 poster

Authors:Petar Veličković, Christos Perivolaropoulos, Federico Barbero, Razvan Pascanu

Title: Softmax is not Enough (for Sharp Size Generalisation)

Abstract:
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

Paperid:726 poster

Authors:Ziyue Qiao, Qianyi Cai, Hao Dong, Jiawei Gu, Pengyang Wang, Meng Xiao, Xiao Luo, Hui Xiong

Title: GCAL: Adapting Graph Models to Evolving Domain Shifts

Abstract:
This paper addresses the challenge of graph domain adaptation on evolving, multiple out-of-distribution (OOD) graphs. Conventional graph domain adaptation methods are confined to single-step adaptation, making them ineffective in handling continuous domain shifts and prone to catastrophic forgetting. This paper introduces the Graph Continual Adaptive Learning (GCAL) method, designed to enhance model sustainability and adaptability across various graph domains. GCAL employs a bilevel optimization strategy. The "adapt" phase uses an information maximization approach to fine-tune the model with new graph domains while re-adapting past memories to mitigate forgetting. Concurrently, the "generate memory" phase, guided by a theoretical lower bound derived from information bottleneck theory, involves a variational memory graph generation module to condense original graphs into memories. Extensive experimental evaluations demonstrate that GCAL substantially outperforms existing methods in terms of adaptability and knowledge retention.

Paperid:727 poster

Authors:Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin

Title: AAAR-1.0: Assessing AI’s Potential to Assist Research

Abstract:
Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1.0 and keep iterating it to new versions.

Paperid:728 poster

Authors:Hongshen Xu, Zichen Zhu, Lei Pan, Zihan Wang, Su Zhu, Da Ma, Ruisheng Cao, Lu Chen, Kai Yu

Title: Reducing Tool Hallucination via Reliability Alignment

Abstract:
Large Language Models (LLMs) have expanded their capabilities beyond language generation to interact with external tools, enabling automation and real-world applications. However, tool hallucinations—where models either select inappropriate tools or misuse them—pose significant challenges, leading to erroneous task execution, increased computational costs, and reduced system reliability. To systematically address this issue, we define and categorize tool hallucinations into two main types: tool selection hallucination and tool usage hallucination. To evaluate and mitigate these issues, we introduce RelyToolBench, which integrates specialized test cases and novel metrics to assess hallucination-aware task success and efficiency. Finally, we propose Relign, a reliability alignment framework that expands the tool-use action space to include indecisive actions, allowing LLMs to defer tool use, seek clarification, or adjust tool selection dynamically. Through extensive experiments, we demonstrate that Relign significantly reduces tool hallucinations, improves task reliability, and enhances the efficiency of LLM tool interactions. The code and data will be publicly available.

Paperid:729 poster

Authors:Nikita Kazeev, Wei Nong, Ignat Romanov, Ruiming Zhu, Andrey E Ustyuzhanin, Shuya Yamazaki, Kedar Hippalgaonkar

Title: Wyckoff Transformer: Generation of Symmetric Crystals

Abstract:
Crystal symmetry plays a fundamental role in determining its physical, chemical, and electronic properties such as electrical and thermal conductivity, optical and polarization behavior, and mechanical strength. Almost all known crystalline materials have internal symmetry. However, this is often inadequately addressed by existing generative models, making the consistent generation of stable and symmetrically valid crystal structures a significant challenge. We introduce WyFormer, a generative model that directly tackles this by formally conditioning on space group symmetry. It achieves this by using Wyckoff positions as the basis for an elegant, compressed, and discrete structure representation. To model the distribution, we develop a permutation-invariant autoregressive model based on the Transformer encoder and an absence of positional encoding. Extensive experimentation demonstrates WyFormer's compelling combination of attributes: it achieves best-in-class symmetry-conditioned generation, incorporates a physics-motivated inductive bias, produces structures with competitive stability, predicts material properties with competitive accuracy even without atomic coordinates, and exhibits unparalleled inference speed.

Paperid:730 poster

Authors:Haoyue Dai, Yiwen Qiu, Ignavier Ng, Xinshuai Dong, Peter Spirtes, Kun Zhang

Title: Latent Variable Causal Discovery under Selection Bias

Abstract:
Addressing selection bias in latent variable causal discovery is important yet underexplored, largely due to a lack of suitable statistical tools: While various tools beyond basic conditional independencies have been developed to handle latent variables, none have been adapted for selection bias. We make an attempt by studying rank constraints, which, as a generalization to conditional independence constraints, exploits the ranks of covariance submatrices in linear Gaussian models. We show that although selection can significantly complicate the joint distribution, interestingly, the ranks in the biased covariance matrices still preserve meaningful information about both causal structures and selection mechanisms. We provide a graph-theoretic characterization of such rank constraints. Using this tool, we demonstrate that the one-factor model, a classical latent variable model, can be identified under selection bias. Simulations and real-world experiments confirm the effectiveness of using our rank constraints.

Paperid:731 poster

Authors:Jingrong Xie, Han Ji, Yanan Sun

Title: Prior Knowledge Guided Neural Architecture Generation

Abstract:
Automated architecture design methods, especially neural architecture search, have attracted increasing attention. However, these methods naturally need to evaluate numerous candidate architectures during the search process, thus computationally extensive and time-consuming. In this paper, we propose a prior knowledge guided neural architecture generation method to generate high-performance architectures without any search and evaluation process. Specifically, in order to identify valuable prior knowledge for architecture generation, we first quantify the contribution of each component within an architecture to its overall performance. Subsequently, a diffusion model guided by prior knowledge is presented, which can easily generate high-performance architectures for different computation tasks. Extensive experiments on new search spaces demonstrate that our method achieves superior accuracy over state-of-the-art methods. For example, we only need $0.004$ GPU Days to generate architecture with $76.1%$ top-1 accuracy on ImageNet and $97.56%$ on CIFAR-10. Furthermore, we can find competitive architecture for more unseen search spaces, such as TransNAS-Bench-101 and NATS-Bench, which demonstrates the broad applicability of the proposed method.

Paperid:732 poster

Authors:Tim Pearce, Tabish Rashid, David Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann

Title: Scaling Laws for Pre-training Agents and World Models

Abstract:
The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task & architecture -- this has important implications on the optimal sizing of models and data.

Paperid:733 poster

Authors:Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison

Title: MixMin: Finding Data Mixtures via Convex Minimization

Abstract:
Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-$410M$ model trained on $8.2B$ tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of $0.03-0.15$.

Paperid:734 poster

Authors:Jongmin Lee, Mario Bravo, Roberto Cominetti

Title: Near-Optimal Sample Complexity for MDPs via Anchoring

Abstract:
We study a new model-free algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halpern's anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|||h||^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $ ||h|| $. Although the complexity bound involves the span seminorm $ ||h|| $ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.

Paperid:735 poster

Authors:Shuaicheng Niu, Guohao Chen, Peilin Zhao, Tianyi Wang, Pengcheng Wu, Zhiqi Shen

Title: Self-Bootstrapping for Versatile Test-Time Adaptation

Abstract:
In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks �?classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image’s geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.

Paperid:736 poster

Authors:Grant Collier Forbes, Jianxun Wang, Leonardo Villalobos-Arias, Arnav Jhala, David Roberts

Title: Action-Dependent Optimality-Preserving Reward Shaping

Abstract:
Recent RL research has utilized reward shaping--particularly complex shaping rewards such as intrinsic motivation (IM)--to encourage agent exploration in sparse-reward environments. While often effective, ``reward hacking'' can lead to the shaping reward being optimized at the expense of the extrinsic reward, resulting in a suboptimal policy. Potential-Based Reward Shaping (PBRS) techniques such as Generalized Reward Matching (GRM) and Policy-Invariant Explicit Shaping (PIES) have mitigated this. These methods allow for implementing IM without altering optimal policies. In this work we show that they are effectively unsuitable for complex, exploration-heavy environments with long-duration episodes. To remedy this, we introduce Action-Dependent Optimality Preserving Shaping (ADOPS), a method of converting intrinsic rewards to an optimality-preserving form that allows agents to utilize IM more effectively in the extremely sparse environment of Montezuma's Revenge. We also prove ADOPS accommodates reward shaping functions that cannot be written in a potential-based form: while PBRS-based methods require the cumulative discounted intrinsic return be independent of actions, ADOPS allows for intrinsic cumulative returns to be dependent on agents' actions while still preserving the optimal policy set. We show how action-dependence enables ADOPS's to preserve optimality while learning in complex, sparse-reward environments where other methods struggle.

Paperid:737 poster

Authors:Aditya Desai, Shuo Yang, Alejandro Cuadron, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Title: HashAttention: Semantic Sparsity for Faster Inference

Abstract:
Leveraging long contexts is crucial for advanced AI systems, but attention computation poses a scalability challenge. While scaled dot-product attention (SDPA) exhibits token sparsity, i.e. only a few pivotal tokens significantly contribute to output, exploiting this sparsity remains challenging. Existing methods either suffer from quality degradation or require substantial additional resources. We show that identifying pivotal tokens is a Maximum Inner Product Search (MIPS) problem. However, existing MIPS solutions are not well-suited for SDPA, as they are not GPU-friendly and often underperform due to the separated query and key distributions. This paper introduces HashAttention, framing pivotal token identification as a recommendation problem. Given a query, HashAttention encodes keys and queries in Hamming space, capturing the required semantic similarity, using learned mapping functions. HashAttention efficiently identifies pivotal tokens for a given query using bitwise operations and computes attention using only these tokens, improving the overall attention efficiency. Trained on generic data, HashAttention reduces tokens used by up to16×with minimal quality loss, requiring only 32 bits of auxiliary memory per token. Sparsity can be further improved to32×through task-specific fine-tuning. On A100 GPU, at32×sparsity, incorporating HashAttention reduces attention latency by up to4.3×in GPT-FAST and2.54×in FlashDecode, and achieves up to3.12×higher throughput for GPT-FAST.

Paperid:738 poster

Authors:Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang

Title: ToMA: Token Merge with Attention for Diffusion Models

Abstract:
Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers�?quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we proposeTokenMerge withAttention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merging as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23% (DINOΔ<0.07), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.

Paperid:739 poster

Authors:Xinyuan Fan, Bufan Li, Chenlei Leng, Weichi Wu

Title: Learning Changes in Graphon Attachment Network Models

Abstract:
This paper introduces Graphon Attachment Network Models (GAN-M), a novel framework for modeling evolving networks with rich structural dependencies, grounded in graphon theory. GAN-M provides a flexible and interpretable foundation for studying network formation by leveraging graphon functions to define attachment probabilities, thereby combining the strengths of graphons with a temporal perspective. A key contribution of this work is a methodology for learning structural changes in these networks over time. Our approach uses graph counts—frequencies of substructures such as triangles and stars—to capture shifts in network topology. We propose a new statistic designed to learn changes in the resulting piecewise polynomial signals and develop an efficient method for change detection, supported by theoretical guarantees. Numerical experiments demonstrate the effectiveness of our approach across various network settings, highlighting its potential for dynamic network analysis.

Paperid:740 poster

Authors:Han Wang, Yang Xu, Wenbin Lu, Rui Song

Title: Off-Policy Evaluation under Nonignorable Missing Data

Abstract:
Off-Policy Evaluation (OPE) aims to estimate the value of a target policy using offline data collected from potentially different policies. In real-world applications, however, logged data often suffers from missingness. While OPE has been extensively studied in the literature, a theoretical understanding of how missing data affects OPE results remains unclear. In this paper, we investigate OPE in the presence of monotone missingness and theoretically demonstrate that the value estimates remain unbiased under ignorable missingness but can be biased under nonignorable (informative) missingness. To retain the consistency of value estimation, we propose an inverse probability weighting value estimator and conduct statistical inference to quantify the uncertainty of the estimates. Through a series of numerical experiments, we empirically demonstrate that our proposed estimator yields a more reliable value inference under missing data.

Paperid:741 poster

Authors:Xingyue Huang, Pablo Barcelo, Michael M. Bronstein, Ismail Ilkan Ceylan, Mikhail Galkin, Juan L Reutter, Miguel Romero Orth

Title: How Expressive are Knowledge Graph Foundation Models?

Abstract:
Knowledge Graph Foundation Models (KGFMs) are at the frontier for deep learning on knowledge graphs (KGs), as they can generalize to completely novel knowledge graphs with different relational vocabularies. Despite their empirical success, our theoretical understanding of KGFMs remains very limited. In this paper, we conduct a rigorous study of the expressive power of KGFMs. Specifically, we show that the expressive power of KGFMs directly depends on themotifsthat are used to learn the relation representations. We then observe that the most typical motifs used in the existing literature arebinary, as the representations are learned based on how pairs of relations interact, which limits the model's expressiveness. As part of our study, we design more expressive KGFMs using richer motifs, which necessitate learning relation representations based on, e.g., how triples of relations interact with each other. Finally, we empirically validate our theoretical findings, showing that the use of richer motifs results in better performance on a wide range of datasets drawn from different domains.

Paperid:742 poster

Authors:Yunfeng Liao, Jiawen Guan, Xiucheng Li

Title: Curvature-aware Graph Attention for PDEs on Manifolds

Abstract:
Deep models have recently achieved remarkable performances in solving partial differential equations (PDEs). The previous methods are mostly focused on PDEs arising in Euclidean spaces with less emphasis on the general manifolds with rich geometry. Several proposals attempt to account for the geometry by exploiting the spatial coordinates but overlook the underlying intrinsic geometry of manifolds. In this paper, we propose a Curvature-aware Graph Attention for PDEs on manifolds by exploring the important intrinsic geometric quantities such as curvature and discrete gradient operator. It is realized via parallel transport and tensor field on manifolds. To accelerate computation, we present three curvature-oriented graph embedding approaches and derive closed-form parallel transport equations, and a subtree partition method is also developed to promote parameter-sharing. Our proposed curvature-aware attention can be used as a replacement for vanilla attention, and experiments show that it significantly improves the performance of the existing methods for solving PDEs on manifolds. Our code is available athttps://github.com/Supradax/CurvGT.

Paperid:743 poster

Authors:David S Hayden, Mao Ye, Timur Garipov, Gregory P. Meyer, Carl Vondrick, Zhao Chen, Yuning Chai, Eric M Wolff, Siddhartha Srinivasa

Title: Generative Data Mining with Longtail-Guided Diffusion

Abstract:
It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.

Paperid:744 poster

Authors:Pan Du, Wangbo Zhao, Xinai Lu, Nian Liu, Zhikai Li, Chaoyu Gong, Suyun Zhao, Hong Chen, Cuiping Li, Kai Wang, Yang You

Title: Unsupervised Learning for Class Distribution Mismatch

Abstract:
Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM’s superiority over previous semi-supervised methods. Specifically, with a 60% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.

Paperid:745 poster

Authors:Qirui Yang, Peng-Tao Jiang, Hao Zhang, Jinwei Chen, Bo Li, Huanjing Yue, Jingyu Yang

Title: Learning Adaptive Lighting via Channel-Aware Guidance

Abstract:
Learning lighting adaptation is a crucial step in achieving good visual perception and supporting downstream vision tasks. Current research often addresses individual light-related challenges, such as high dynamic range imaging and exposure correction, in isolation. However, we identify shared fundamental properties across these tasks: i) different color channels have different light properties, and ii) the channel differences reflected in the spatial and frequency domains are different. Leveraging these insights, we introduce the channel-aware Learning Adaptive Lighting Network (LALNet), a multi-task framework designed to handle multiple light-related tasks efficiently. Specifically, LALNet incorporates color-separated features that highlight the unique light properties of each color channel, integrated with traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across all channels. Additionally, LALNet employs dual domain channel modulation for generating color-separated features and a mixed channel modulation and light state space module for producing color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo atLALNet.

Paperid:746 poster

Authors:Valentin De Bortoli, Alexandre Galashov, Arthur Gretton, Arnaud Doucet

Title: Accelerated Diffusion Models via Speculative Sampling

Abstract:
Speculative sampling is a popular technique for accelerating inference in Large Language Models by generating candidate tokens using a fast draft model and then accepting or rejecting them based on the target model's distribution. While speculative sampling was previously limited to discrete sequences, we extend it to diffusion models, which generate samples via continuous, vector-valued Markov chains. In this context, the target model is a high-quality but computationally expensive diffusion model. We propose various drafting strategies, including a simple and effective approach that does not require training a draft model and is applicable out-of-the-box to any diffusion model. We demonstrate significant generation speedup on various diffusion models, halving the number of function evaluations while generating exact samples from the target model. Finally, we also show how this procedure can be used to accelerate Langevin diffusions to sample unnormalized distributions.

Paperid:747 poster

Authors:Vincent Cohen-Addad, Silvio Lattanzi, Simon Meierhans

Title: Algorithms and Hardness for Active Learning on Graphs

Abstract:
We study the offline active learning problem on graphs. In this problem, one seeks to select k vertices whose labels are best suited for predicting the labels of all the other vertices in the graph. Guillory and Bilmes (Guillory & Bilmes, 2009) introduced a natural theoretical model motivated by a label smoothness assumption. Prior to our work, algorithms with theoretical guarantees were only known for restricted graph types such as trees (Cesa-Bianchi et al., 2010) despite the models simplicity. We present the first O(log n)-resource augmented algorithm for general weighted graphs. To complement our algorithm, we show constant hardness of approximation.

Paperid:748 poster

Authors:Abhishek Tyagi, Arjun Iyer, William H Renninger, Christopher Kanan, Yuhao Zhu

Title: Dynamic Sparse Training of Diagonally Sparse Networks

Abstract:
Recent advances in Dynamic Sparse Training (DST) have pushed the frontier of sparse neural network training in structured and unstructured contexts, matching dense-model performance while drastically reducing parameter counts to facilitate model scaling. However, unstructured sparsity often fails to translate into practical speedups on modern hardware. To address this shortcoming, we propose DynaDiag, a novel structured sparse-to-sparse DST method that performs at par with unstructured sparsity. DynaDiag enforces a diagonal sparsity pattern throughout training and preserves sparse computation in forward and backward passes. We further leverage the diagonal structure to accelerate computation via a custom CUDA kernel, rendering the method hardware-friendly. Empirical evaluations on diverse neural architectures demonstrate that our method maintains accuracy on par with unstructured counterparts while benefiting from tangible computational gains. Notably, with 90% sparse linear layers in ViTs, we observe up to a 3.13x speedup in online inference without sacrificing model performance and a 1.59x speedup in training on a GPU compared to equivalent unstructured layers.

Paperid:749 poster

Authors:Ye Zhang, Yu Zhou, Yifeng Wang, Jun Xiao, Ziyue Wang, Yongbing Zhang, Jianxu Chen

Title: The Four Color Theorem for Cell Instance Segmentation

Abstract:
Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state-of-the-art performance. The code is available athttps://github.com/zhangye-zoe/FCIS.

Paperid:750 poster

Authors:Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, Kangwook Lee

Title: Parameter-Efficient Fine-Tuning of State Space Models

Abstract:
Deep State Space Models (SSMs), such as Mamba (Gu & Dao, 2024), have become powerful tools for language modeling, offering high performance and linear scalability with sequence length. However, the application of parameter-efficient fine-tuning (PEFT) methods to SSM-based models remains largely underexplored. We start by investigating two fundamental questions on existing PEFT methods: (i) How do they perform on SSM-based models? (ii) Which parameters should they target for optimal results? Our analysis shows that LoRA and its variants consistently outperform all other PEFT methods. While LoRA is effective for linear projection matrices, it fails on SSM modules—yet still outperforms other methods applicable to SSMs, indicating their limitations. This underscores the need for a specialized SSM tuning approach. To address this, we propose Sparse Dimension Tuning (SDT), a PEFT method tailored for SSM modules. Combining SDT for SSMs with LoRA for linear projection matrices, we achieve state-of-the-art performance across extensive experiments.

Paperid:751 poster

Authors:Arnab Bhattacharyya, Davin Choo, Philips George John, Themis Gouleakis

Title: Learning multivariate Gaussians with imperfect advice

Abstract:
We revisit the problem of distribution learning within the framework of learning-augmented algorithms. In this setting, we explore the scenario where a probability distribution is provided as potentially inaccurate advice on the true, unknown distribution. Our objective is to develop learning algorithms whose sample complexity decreases as the quality of the advice improves, thereby surpassing standard learning lower bounds when the advice is sufficiently accurate. Specifically, we demonstrate that this outcome is achievable for the problem of learning a multivariate Gaussian distributionN(μ,Σ)in the PAC learning setting. Classically, in the advice-free setting,Θ~(d2/ε2)samples are sufficient and worst case necessary to learnd-dimensional Gaussians up to TV distanceεwith constant probability. When we are additionally given a parameterΣ~as advice, we show thatO~(d2−�?ε2)samples suffices whenever||Σ~�?/2ΣΣ~�?/2−Id||1≤εd1−�?where||⋅||1denotes the entrywise�?norm) for anyβ>0, yielding a polynomial improvement over the advice-free setting.

Paperid:752 poster

Authors:Xingwu Sun, Shuaipeng Li, Ruobing Xie, Weidong Han, Kan Wu, Zhen Yang, Yixing Li, An Wang, SHUAI LI, Jinbao Xue, Yu Cheng, Yangyu Tao, Zhanhui Kang, Cheng-zhong Xu, Di Wang, Jie Jiang

Title: Scaling Laws for Floating–Point Quantization Training

Abstract:
Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point (FP) quantization, and thus cannot well fit the LLM losses in this scenario. In contrast, while FP quantization training is more commonly implemented in production, it's research has been relatively superficial. In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models. In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal FP quantization precision is directly proportional to the computational power, but within a wide computational power range. We estimate that the best cost-performance precision should lie between 4-8 bits.

Paperid:753 poster

Authors:Kayo Yin, Jacob Steinhardt

Title: Which Attention Heads Matter for In-Context Learning?

Abstract:
Large language models (LLMs) exhibit impressive in-context learning (ICL) capability, enabling them to generate relevant responses from a handful of task demonstrations in the prompt. Prior studies have suggested two different explanations for the mechanisms behind ICL: induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task. To better understand which of the two distinct mechanisms drives ICL, we study and compare induction heads and FV heads in 12 language models. Through detailed ablations, we find that few-shot ICL is driven primarily by FV heads, especially in larger models. We also find that FV and induction heads are connected: many FV heads start as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction facilitates learning the more complex FV mechanism for ICL.

Paperid:754 poster

Authors:Xinhao Zheng, Huiqi Deng, Quanshi Zhang

Title: Towards Attributions of Input Variables in a Coalition

Abstract:
This paper focuses on the fundamental challenge of partitioning input variables in attribution methods for Explainable AI, particularly in Shapley value-based approaches. Previous methods always compute attributions given a predefined partition but lack theoretical guidance on how to form meaningful variable partitions. We identify that attribution conflicts arise when the attribution of a coalition differs from the sum of its individual variables' attributions. To address this, we analyze the numerical effects of AND-OR interactions in AI models and extend the Shapley value to a new attribution metric for variable coalitions. Our theoretical findings reveal that specific interactions cause attribution conflicts, and we propose three metrics to evaluate coalition faithfulness. Experiments on synthetic data, NLP, image classification, and the game of Go validate our approach, demonstrating consistency with human intuition and practical applicability.

Paperid:755 poster

Authors:Puning Zhao, Rongfei Fan, Shaowei Wang, Li Shen, Qixin Zhang, ZongKe, Tianhang Zheng

Title: Contextual Bandits for Unbounded Context Distributions

Abstract:
Nonparametric contextual bandit is an important model of sequential decision making problems. Under $\alpha$-Tsybakov margin condition, existing research has established a regret bound of $\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$ for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed $k$. Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive $k$. By a proper data-driven selection of $k$, this method achieves an expected regret of $\tilde{O}\left(T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}+T^{1-\beta}\right)$, in which $\beta$ is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.

Paperid:756 poster

Authors:Philipp Höllmer, Thomas Egg, Maya Martirossyan, Eric Fuemmeler, Zeren Shui, Amit Gupta, Pawan Prakash, Adrian Roitberg, Mingjie Liu, George Karypis, Mark Transtrum, Richard Hennig, Ellad B. Tadmor, Stefano Martiniani

Title: Open Materials Generation with Stochastic Interpolants

Abstract:
The discovery of new materials is essential for enabling technological advancements. Computational approaches for predicting novel materials must effectively learn the manifold of stable crystal structures within an infinite design space. We introduce Open Materials Generation (OMatG), a unifying framework for the generative design and discovery of inorganic crystalline materials. OMatG employs stochastic interpolants (SI) to bridge an arbitrary base distribution to the target distribution of inorganic crystals via a broad class of tunable stochastic processes, encompassing both diffusion models and flow matching as special cases. In this work, we adapt the SI framework by integrating an equivariant graph representation of crystal structures and extending it to account for periodic boundary conditions in unit cell representations. Additionally, we couple the SI flow over spatial coordinates and lattice vectors with discrete flow matching for atomic species. We benchmark OMatG's performance on two tasks: Crystal Structure Prediction (CSP) for specified compositions, and de novo generation (DNG) aimed at discovering stable, novel, and unique structures. In our ground-up implementation of OMatG, we refine and extend both CSP and DNG metrics compared to previous works. OMatG establishes a new state of the art in generative modeling for materials discovery, outperforming purely flow-based and diffusion-based implementations. These results underscore the importance of designing flexible deep learning frameworks to accelerate progress in materials science. The OMatG code is available athttps://github.com/FERMat-ML/OMatG.

Paperid:757 poster

Authors:Nazanin Mohammadi Sepahvand, Anvith Thudi, Berivan Isik, Ashmita Bhattacharyya, Nicolas Papernot, Eleni Triantafillou, Daniel M. Roy, Gintare Karolina Dziugaite

Title: Leveraging Per-Instance Privacy for Machine Unlearning

Abstract:
We present a principled, per-instance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning (Chien et al., 2024), obtaining a better utility–unlearning trade-off by replacing worst-case privacy loss bounds with per-instance privacy losses (Thudi et al., 2024), each of which bounds the (R ´enyi) divergence to retraining without an individual datapoint. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while also identifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points.

Paperid:758 poster

Authors:Hamza Cherkaoui, Merwan Barlier, Igor Colin

Title: Adaptive Sample Sharing for Multi Agent Linear Bandits

Abstract:
The multi-agent linear bandit setting is a well-known setting for which designing efficient collaboration between agents remains challenging. This paper studies the impact of data sharing among agents on regret minimization. Unlike most existing approaches, our contribution does not rely on any assumptions on the bandit parameters structure. Our main result formalizes the trade-off between the bias and uncertainty of the bandit parameter estimation for efficient collaboration. This result is the cornerstone of the Bandit Adaptive Sample Sharing (BASS) algorithm, whose efficiency over the current state-of-the-art is validated through both theoretical analysis and empirical evaluations on both synthetic and real-world datasets. Furthermore, we demonstrate that, when agents' parameters display a cluster structure, our algorithm accurately recovers them.

Paperid:759 poster

Authors:Yurun Yuan, Tengyang Xie

Title: Reinforce LLM Reasoning through Multi-Agent Reflection

Abstract:
Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (DirectPolicySearch byDynamicProgramming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.

Paperid:760 poster

Authors:Louis Airale, Antonio Longa, Mattia Rigon, Andrea Passerini, Roberto Passerone

Title: Simple Path Structural Encoding for Graph Transformers

Abstract:
Graph transformers extend global self-attention to graph-structured data, achieving notable success in graph learning. Recently, random walk structural encoding (RWSE) has been found to further enhance their predictive power by encoding both structural and positional information into the edge representation. However, RWSE cannot always distinguish between edges that belong to different local graph patterns, which reduces its ability to capture the full structural complexity of graphs. This work introduces Simple Path Structural Encoding (SPSE), a novel method that utilizes simple path counts for edge encoding. We show theoretically and experimentally that SPSE overcomes the limitations of RWSE, providing a richer representation of graph structures, particularly in capturing local cyclic patterns. To make SPSE computationally tractable, we propose an efficient approximate algorithm for simple path counting. SPSE demonstrates significant performance improvements over RWSE on various benchmarks, including molecular and long-range graph datasets, achieving statistically significant gains in discriminative tasks. These results pose SPSE as a powerful edge encoding alternative for enhancing the expressivity of graph transformers.

Paperid:761 poster

Authors:Anne Wu, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez

Title: Aligning Spoken Dialogue Models from User Interactions

Abstract:
We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.

Paperid:762 poster

Authors:Ting Jiang, Yixiao Wang, Hancheng Ye, Zishan Shao, Jingwei Sun, Jingyang Zhang, Zekai Chen, Jianyi Zhang, Yiran Chen, Hai Li

Title: SADA: Stability-guided Adaptive Diffusion Acceleration

Abstract:
Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic‐attention costs. Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness compared to the original baseline. We hypothesize that this fidelity gap arises because (a) different prompts correspond to varying denoising trajectory, and (b) such methods do not consider the underlying ODE formulation and its numerical solution. In this paper, we proposeStability-guided Adaptive Diffusion Acceleration (SADA), a novel paradigm that unifies step-wise and token-wise sparsity decisions via a single stability criterion to accelerate sampling of ODE-based generative models (Diffusion and Flow-matching). For (a), SADA adaptively allocates sparsity based on the sampling trajectory. For (b), SADA introduces principled approximation schemes that leverage the precise gradient information from the numerical ODE solver. Comprehensive evaluations on SD�?, SDXL, and Flux using both EDM and DPM++ solvers reveal consistent $\ge 1.8\times$ speedups with minimal fidelity degradation (LPIPS $\leq 0.10$ and FID $\leq 4.5$) compared to unmodified baselines, significantly outperforming prior methods. Moreover, SADA adapts seamlessly to other pipelines and modalities: It accelerates ControlNet without any modifications and speeds up MusicLDM by (1.8\times) with (\sim 0.01) spectrogram LPIPS. Our code is available at:https://github.com/Ting-Justin-Jiang/sada-icml.

Paperid:763 poster

Authors:Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, Ion Stoica

Title: Prompt-to-Leaderboard: Prompt-Adaptive LLM Evaluations

Abstract:
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt or set of prompts. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available at this GitHub link:https://github.com/lmarena/p2l.

Paperid:764 poster

Authors:Renaud Gaucher, Aymeric Dieuleveut, Hadrien Hendrikx

Title: Unified Breakdown Analysis for Byzantine Robust Gossip

Abstract:
In decentralized machine learning, different devices communicate in a peer-to-peer manner to collaboratively learn from each other's data. Such approaches are vulnerable to misbehaving (or Byzantine) devices. We introduce F-RG, a general framework for building robust decentralized algorithms with guarantees arising from robust-sum-like aggregation rules F. We then investigate the notion ofbreakdown point, and show an upper bound on the number of adversaries that decentralized algorithms can tolerate. We introduce a practical robust aggregation rule, coined CS+, such that CS+-RG has a near-optimal breakdown. Other choices of aggregation rules lead to existing algorithms such as ClippedGossip or NNA. We give experimental evidence to validate the effectiveness of CS+-RG and highlight the gap with NNA, in particular against a novel attack tailored to decentralized communications.

Paperid:765 poster

Authors:Luis Scoccola, Uzu Lim, Heather A. Harrington

Title: Cover learning for large-scale topology representation

Abstract:
Classical unsupervised learning methods like clustering and linear dimensionality reduction parametrize large-scale geometry when it is discrete or linear, while more modern methods from manifold learning find low dimensional representation or infer local geometry by constructing a graph on the input data. More recently, topological data analysis popularized the use of simplicial complexes to represent data topology with two main methodologies: topological inference with geometric complexes and large-scale topology representation with Mapper graphs -- central to these is the nerve construction from topology, which builds a simplicial complex given any cover of a space by subsets. While successful, these have limitations: geometric complexes scale poorly with data size, and Mapper graphs can be hard to tune and only contain low dimensional information. In this paper, we propose to study the problem of learning covers in its own right, and from the perspective of optimization. We describe a method to learn topologically-faithful covers of geometric datasets, and show that the simplicial complexes thus obtained can outperform standard topological inference approaches in terms of size, and Mapper-type algorithms in terms of representation of large-scale topology.

Paperid:766 poster

Authors:Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J Zico Kolter, Michael Qizhe Shieh

Title: Unnatural Languages Are Not Bugs but Features for LLMs

Abstract:
Large Language Models (LLMs) have been observed to process non-human-readable text sequences, such as jailbreak prompts, often viewed as a bug for aligned LLMs. In this work, we present a systematic investigation challenging this perception, demonstrating that unnatural languages - strings that appear incomprehensible to humans but maintain semantic meanings for LLMs - contain latent features usable by models. Notably, unnatural languages possess latent features that can be generalized across different models and tasks during inference. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving (49.71) win rates in Length-controlled AlpacaEval 2.0 in average across various base models. In addition, through comprehensive analysis, we demonstrate that LLMs process unnatural languages by filtering noise and inferring contextual meaning from filtered words. Our code is publicly available athttps://github.com/John-AI-Lab/Unnatural_Language.

Paperid:767 poster

Authors:Vignesh Gopakumar, Ander Gray, Lorenzo Zanisi, Timothy Nunn, Daniel Giles, Matt Kusner, Stanislas Pamela, Marc Peter Deisenroth

Title: Calibrated Physics-Informed Uncertainty Quantification

Abstract:
Simulating complex physical systems is crucial for understanding and predicting phenomena across diverse fields, such as fluid dynamics and heat transfer, as well as plasma physics and structural mechanics. Traditional approaches rely on solving partial differential equations (PDEs) using numerical methods, which are computationally expensive and often prohibitively slow for real-time applications or large-scale simulations. Neural PDEs have emerged as efficient alternatives to these costly numerical solvers, offering significant computational speed-ups. However, their lack of robust uncertainty quantification (UQ) limits deployment in critical applications. We introduce a model-agnostic, physics-informed conformal prediction (CP) framework that provides guaranteed uncertainty estimates without requiring labelled data. By utilising a physics-based approach, we can quantify and calibrate the model's inconsistencies with the physics rather than the uncertainty arising from the data. Our approach utilises convolutional layers as finite-difference stencils and leverages physics residual errors as nonconformity scores, enabling data-free UQ with marginal and joint coverage guarantees across prediction domains for a range of complex PDEs. We further validate the efficacy of our method on neural PDE models for plasma modelling and shot design in fusion reactors.

Paperid:768 poster

Authors:Alessandro Manenti, Daniele Zambon, Cesare Alippi

Title: Learning Latent Graph Structures and their Uncertainty

Abstract:
Graph neural networks use relational information as an inductive bias to enhance prediction performance. Not rarely, task-relevant relations are unknown and graph structure learning approaches have been proposed to learn them from data. Given their latent nature, no graph observations are available to provide a direct training signal to the learnable relations. Therefore, graph topologies are typically learned on the prediction task alongside the other graph neural network parameters. In this paper, we demonstrate that minimizing point-prediction losses does not guarantee proper learning of the latent relational information and its associated uncertainty. Conversely, we prove that suitable loss functions on the stochastic model outputs simultaneously grant solving two tasks: (i) learning the unknown distribution of the latent graph and (ii) achieving optimal predictions of the target variable. Finally, we propose a sampling-based method that solves this joint learning task. Empirical results validate our theoretical claims and demonstrate the effectiveness of the proposed approach.

Paperid:769 poster

Authors:Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Mingyang Wan, Guojun Ma, Xiang Wang, Xiangnan He, Tat-Seng Chua

Title: AnyEdit: Edit Any Knowledge Encoded in Language Models

Abstract:
Large language models (LLMs) often produce incorrect or outdated information, necessitating efficient and precise knowledge updates. Current model editing methods, however, struggle with long-form knowledge in diverse formats, such as poetry, code snippets, and mathematical derivations. These limitations arise from their reliance on editing a single token’s hidden state, a limitation we term as ``efficacy barrier''. To solve this, we propose \textbf{AnyEdit}, a new autoregressive editing paradigm. It decomposes long-form knowledge into sequential chunks and iteratively edits the key token in each chunk, ensuring consistent and accurate outputs. Theoretically, we ground AnyEdit in the Chain Rule of Mutual Information, showing its ability to update any knowledge within LLMs. Empirically, it outperforms strong baselines by 21.5% on benchmarks including UnKEBench, AKEW, and our new \textbf{EditEverything} dataset for long-form diverse-formatted knowledge. Additionally, AnyEdit serves as a plug-and-play framework, enabling current editing methods to update knowledge with arbitrary length and format, significantly advancing the scope and practicality of LLM knowledge editing. Our code is available at: \url{https://github.com/jianghoucheng/AnyEdit}.

Paperid:770 poster

Authors:Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

Title: The Role of Sparsity for Length Generalization in LLMs

Abstract:
Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call k-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model allows us to provide justifications for techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling.We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is indeed a ``sparse'' dependency structure of each token on the previous ones. Further, inspired by our theory, we introduce Predictive Position Coupling, a generalization of position coupling which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which Position Coupling can successfully be applied to achieve length generalization.

Paperid:771 poster

Authors:Chengrui Gao, Haopu Shang, Ke Xue, Chao Qian

Title: Neural Solver Selection for Combinatorial Optimization

Abstract:
Machine learning has increasingly been employed to solve NP-hard combinatorial optimization problems, resulting in the emergence of neural solvers that demonstrate remarkable performance, even with minimal domain-specific knowledge. To date, the community has created numerous open-source neural solvers with distinct motivations and inductive biases. While considerable efforts are devoted to designing powerful single solvers, our findings reveal that existing solvers typically demonstrate complementary performance across different problem instances. This suggests that significant improvements could be achieved through effective coordination of neural solvers at the instance level. In this work, we propose the first general framework to coordinate the neural solvers, which involves feature extraction, selection model, and selection strategy, aiming to allocate each instance to the most suitable solvers. To instantiate, we collect several typical neural solvers with state-of-the-art performance as alternatives, and explore various methods for each component of the framework. We evaluated our framework on two typical problems, Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP). Experimental results show that our framework can effectively distribute instances and the resulting composite solver can achieve significantly better performance (e.g., reduce the optimality gap by 0.88% on TSPLIB and 0.71% on CVRPLIB) than the best individual neural solver with little extra time cost.

Paperid:772 poster

Authors:Luan Yang, Jingdong Zhang, Qunxi Zhu, Wei Lin

Title: Neural Event-Triggered Control with Optimal Scheduling

Abstract:
Learning-enabled controllers with stability certificate functions have demonstrated impressive empirical performance in addressing control problems in recent years. Nevertheless, directly deploying the neural controllers onto actual digital platforms requires impractically excessive communication resources due to a continuously updating demand from the closed-loop feedback controller. We introduce a framework aimed at learning the event-triggered controller (ETC) with optimal scheduling, i.e., minimal triggering times, to address this challenge in resource-constrained scenarios. Our proposed framework, denoted by Neural ETC, includes two practical algorithms: the path integral algorithm based on directly simulating the event-triggered dynamics, and the Monte Carlo algorithm derived from new theoretical results regarding lower bound of inter-event time. Furthermore, we propose a projection operation with an analytical expression that ensures theoretical stability and schedule optimality for Neural ETC. Compared to the conventional neural controllers, our empirical results show that the Neural ETC significantly reduces the required communication resources while enhancing the control performance in constrained communication resources scenarios.

Paperid:773 poster

Authors:Dilfira Kudrat, Zongxia Xie, Yanru Sun, Tianyu Jia, Qinghua Hu

Title: Patch-wise Structural Loss for Time Series Forecasting

Abstract:
Time-series forecasting has gained significant attention in machine learning due to its crucial role in various domains. However, most existing forecasting models rely heavily on point-wise loss functions like Mean Squared Error, which treat each time step independently and neglect the structural dependencies inherent in time series data, making it challenging to capture complex temporal patterns accurately. To address these challenges, we propose a novelPatch-wiseStructural (PS) loss, designed to enhance structural alignment by comparing time series at the patch level. Through leveraging local statistical properties, such as correlation, variance, and mean, PS loss captures nuanced structural discrepancies overlooked by traditional point-wise losses. Furthermore, it integrates seamlessly with point-wise loss, simultaneously addressing local structural inconsistencies and individual time-step errors. PS loss establishes a novel benchmark for accurately modeling complex time series data and provides a new perspective on time series loss function design. Extensive experiments demonstrate that PS loss significantly improves the performance of state-of-the-art models across diverse real-world datasets. The data and code are publicly available at: \url{https://github.com/Dilfiraa/PS_Loss}.

Paperid:774 poster

Authors:Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, JUN ZHOU, Chongxuan Li

Title: Effective and Efficient Masked Image Generation Models

Abstract:
Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as \textbf{eMIGM}. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet $256\times256$, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion model REPA while requiring less than 45% of the NFE. Additionally, on ImageNet $512\times512$, eMIGM outperforms the strong continuous diffusion model EDM2. Code is available at \url{https://github.com/ML-GSAI/eMIGM}.

Paperid:775 poster

Authors:Peng Jin, Bo Zhu, Li Yuan, Shuicheng YAN

Title: MoH: Multi-Head Attention as Mixture-of-Head Attention

Abstract:
In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%�?0% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.

Paperid:776 poster

Authors:Masahiro Fujisawa, Futoshi Futami

Title: PAC-Bayes Analysis for Recalibration in Classification

Abstract:
Nonparametric estimation using uniform-width binning is a standard approach for evaluating the calibration performance of machine learning models. However, existing theoretical analyses of the bias induced by binning are limited to binary classification, creating a significant gap with practical applications such as multiclass classification. Additionally, many parametric recalibration algorithms lack theoretical guarantees for their generalization performance. To address these issues, we conduct a generalization analysis of calibration error using the probably approximately correct Bayes framework. This approach enables us to derive the first optimizable upper bound for generalization error in the calibration context. On the basis of our theory, we propose a generalization-aware recalibration algorithm. Numerical experiments show that our algorithm enhances the performance of Gaussian process-based recalibration across various benchmark datasets and models.

Paperid:777 poster

Authors:Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D. Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, Hokin Deng

Title: Core Knowledge Deficits in Multi-Modal Language Models

Abstract:
While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge—rudimentary cognitive abilities innate to humans from early childhood. To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science. We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones. Finally, we propose Concept Hacking, a novel controlled evaluation method that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.

Paperid:778 poster

Authors:Feng Wu, Tsai Hor Chan, Fuying Wang, Guosheng Yin, Lequan Yu

Title: Cross-Modal Alignment via Variational Copula Modelling

Abstract:
Various data modalities are common in real-world applications. (e.g., EHR, medical images and clinical notes in healthcare). Thus, it is essential to develop multimodal learning methods to aggregate information from multiple modalities. The main challenge is appropriately aligning and fusing the representations of different modalities into a joint distribution. Existing methods mainly rely on concatenation or the Kronecker product, oversimplifying interactions structure between modalities and indicating a need to model more complex interactions. Additionally, the joint distribution of latent representations with higher-order interactions is underexplored. Copula is a powerful statistical structure in modelling the interactions between variables, as it bridges the joint distribution and marginal distributions of multiple variables. In this paper, we propose a novel copula modelling-driven multimodal learning framework, which focuses on learning the joint distribution of various modalities to capture the complex interaction among them. The key idea is interpreting the copula model as a tool to align the marginal distributions of the modalities efficiently. By assuming a Gaussian mixture distribution for each modality and a copula model on the joint distribution, our model can also generate accurate representations for missing modalities. Extensive experiments on public MIMIC datasets demonstrate the superior performance of our model over other competitors. The code is anonymously available athttps://github.com/HKU-MedAI/CMCM.

Paperid:779 poster

Authors:Shangzhe Li, Zhiao Huang, Hao Su

Title: Reward-free World Models for Online Imitation Learning

Abstract:
Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in the reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

Paperid:780 poster

Authors:Tikeng Notsawo Pascal Junior, Guillaume Dumas, Guillaume Rabusseau

Title: Grokking Beyond the Euclidean Norm of Model Parameters

Abstract:
Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a propertyP(e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization ofP(e.g.,�?or nuclear norm regularization) results in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the�?norm is not a reliable proxy for generalization when the model is regularized toward a different propertyP, as the�?norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified solely through data selection, with any other hyperparameter fixed.

Paperid:781 poster

Authors:Peter Blohm, Patrick Indri, Thomas Gärtner, SAGAR MALHOTRA

Title: Probably Approximately Global Robustness Certification

Abstract:
We propose and investigate probabilistic guarantees for the adversarial robustness of classification algorithms. While traditional formal verification approaches for robustness are intractable and sampling-based approaches do not provide formal guarantees, our approach is able to efficiently certify a probabilistic relaxation of robustness. The key idea is to sample an $\epsilon$-net and invoke a local robustness oracle on the sample. Remarkably, the size of the sample needed to achieve probably approximately global robustness guarantees is independent of the input dimensionality, the number of classes, and the learning algorithm itself. Our approach can, therefore, be applied even to large neural networks that are beyond the scope of traditional formal verification. Experiments empirically confirm that it characterizes robustness better than state-of-the-art sampling-based approaches and scales better than formal methods.

Paperid:782 poster

Authors:Yazid Janati, Badr MOUFAD, Mehdi Abou El Qassime, Alain Oliviero Durmus, Eric Moulines, Jimmy Olsson

Title: A Mixture-Based Framework for Guiding Diffusion Models

Abstract:
Denoising diffusion models have driven significant progress in the field of Bayesian inverse problems. Recent approaches use pre-trained diffusion models as priors to solve a wide range of such problems, only leveraging inference-time compute and thereby eliminating the need to retrain task-specific models on the same dataset. To approximate the posterior of a Bayesian inverse problem, a diffusion model samples from a sequence of intermediate posterior distributions, each with an intractable likelihood function. This work proposes a novel mixture approximation of these intermediate distributions. Since direct gradient-based sampling of these mixtures is infeasible due to intractable terms, we propose a practical method based on Gibbs sampling. We validate our approach through extensive experiments on image inverse problems, utilizing both pixel- and latent-space diffusion priors, as well as on source separation with an audio diffusion model. The code is available at \url{https://www.github.com/badr-moufad/mgdm}.

Paperid:783 poster

Authors:Zhiheng Zhang, Haoxiang Wang, Haoxuan Li, Zhouchen Lin

Title: Active Treatment Effect Estimation via Limited Samples

Abstract:
Designing experiments for causal effect estimation remains an enduring topic in both machine learning and statistics. While much of the existing statistical literature focuses on using central limit theorems to analyze asymptotic properties of estimators, a parallel line of research has emerged around theoretical tools that provide finite-sample error bounds, offering performance on par with—or superior to—the asymptotic approaches. These finite-sample results are especially relevant in active sampling settings where the sample size is limited (for instance, under privacy or cost constraints). In this paper, we develop a finite-sample estimator with sample complexity analysis and extend its applicability to social networks. Through simulations and real-world experiments, we show that our method achieves higher estimation accuracy with fewer samples than traditional estimators endowed with asymptotic normality and other estimators backed by finite-sample guarantees.

Paperid:784 poster

Authors:Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John C.S. Lui, Wei Chen

Title: Offline Learning for Combinatorial Multi-armed Bandits

Abstract:
The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.

Paperid:785 poster

Authors:Deepak Ravikumar, Efstathia Soufleri, Abolfazl Hashemi, Kaushik Roy

Title: Towards Memorization Estimation: Fast, Formal and Free

Abstract:
Deep learning has become the de facto approach in nearly all learning tasks. It has been observed that deep models tend to memorize and sometimes overfit data, which can lead to compromises in performance, privacy, and other critical metrics. In this paper, we explore the theoretical foundations that connect memorization to sample loss, focusing on learning dynamics to understand what and how deep models memorize. To this end, we introduce a novel proxy for memorization: Cumulative Sample Loss (CSL). CSL represents the accumulated loss of a sample throughout the training process. CSL exhibits remarkable similarity to stability-based memorization, as evidenced by considerably high cosine similarity scores. We delve into the theory behind these results, demonstrating that low CSL leads to nontrivial bounds on the extent of stability-based memorization and learning time. The proposed proxy, CSL, is four orders of magnitude less computationally expensive than the stability-based method and can be obtained with zero additional overhead during training. We demonstrate the practical utility of the proposed proxy in identifying mislabeled samples and detecting duplicates where our metric achieves state-of-the-art performance.

Paperid:786 poster

Authors:Abheek Ghosh, Tzeh Yuan Neoh, Nicholas Teh, Giannis Tyrovolas

Title: Fraud-Proof Revenue Division on Subscription Platforms

Abstract:
We study a model of subscription-based platforms where users pay a fixed fee for unlimited access to content, and creators receive a share of the revenue. Existing approaches to detecting fraud predominantly rely on machine learning methods, engaging in an ongoing arms race with bad actors. We explore revenue division mechanisms that inherently disincentivize manipulation. We formalize three types of manipulation-resistance axioms and examine which existing rules satisfy these. We show that a mechanism widely used by streaming platforms, not only fails to prevent fraud, but also makes detecting manipulation computationally intractable. We also introduce a novel rule, ScaledUserProp, that satisfies all three manipulation-resistance axioms. Finally, experiments with both real-world and synthetic streaming data support ScaledUserProp as a fairer alternative compared to existing rules.

Paperid:787 poster

Authors:Shaofeng Yin, Jialong Wu, Siqiao Huang, Xingjian Su, Xu He, Jianye HAO, Mingsheng Long

Title: Trajectory World Models for Heterogeneous Environments

Abstract:
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models on top of this low-dimensional sensor information. In this work, we explore pre-training world models for heterogeneous environments by addressing key transfer barriers in both data diversity and model flexibility. We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity. Additionally, we propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context. Pre-training TrajWorld on UniTraj yields substantial gains in transition prediction, achieves a new state-of-the-art for off-policy evaluation, and also delivers superior online performance of model predictive control. To the best of our knowledge, this work, for the first time, demonstrates the transfer benefits of world models across heterogeneous and complex control environments. Code and data are available athttps://github.com/thuml/TrajWorld.

Paperid:788 poster

Authors:Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, Zexue He

Title: M+: Extending MemoryLLM with Scalable Long-Term Memory

Abstract:
Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.

Paperid:789 poster

Authors:Haoqi Wang, Tong Zhang, Mathieu Salzmann

Title: Demystifying Singular Defects in Large Language Models

Abstract:
Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs. Code is released athttps://github.com/haoqiwang/singular_defect.

Paperid:790 poster

Authors:Qianxiong Xu, Lanyun Zhu, Xuanyi Liu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao

Title: Unlocking the Power of SAM 2 for Few-Shot Segmentation

Abstract:
Few-Shot Segmentation (FSS) aims to learn class-agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well-learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class-agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support-Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL-5$^i$ and COCO-20$^i$ to validate the effectiveness of our design, e.g., the 1-shot mIoU can be 4.2% better than the best baseline.

Paperid:791 poster

Authors:Haojun Chen, Minghao Liu, Chengdong Ma, Xiaojian Ma, Zailin Ma, Huimin Wu, Yuanpei Chen, Yifan Zhong, Mingzhi Wang, Qing Li, Yaodong Yang

Title: Falcon: Fast Visuomotor Policies via Partial Denoising

Abstract:
Diffusion policies are widely adopted in complex visuomotor tasks for their ability to capture multimodal action distributions. However, the multiple sampling steps required for action generation significantly harm real-time inference efficiency, which limits their applicability in real-time decision-making scenarios. Existing acceleration techniques either require retraining or degrade performance under low sampling steps. Here we propose Falcon, which mitigates this speed-performance trade-off and achieves further acceleration. The core insight is that visuomotor tasks exhibit sequential dependencies between actions. Falcon leverages this by reusing partially denoised actions from historical information rather than sampling from Gaussian noise at each step. By integrating current observations, Falcon reduces sampling steps while preserving performance. Importantly, Falcon is a training-free algorithm that can be applied as a plug-in to further improve decision efficiency on top of existing acceleration techniques. We validated Falcon in 48 simulated environments and 2 real-world robot experiments. demonstrating a 2-7x speedup with negligible performance degradation, offering a promising direction for efficient visuomotor policy design.

Paperid:792 poster

Authors:Thibaut Germain, Chrysoula Kosma, Laurent Oudre

Title: Time Series Representations with Hard-Coded Invariances

Abstract:
Automatically extracting robust representations from large and complex time series data is becoming imperative for several real-world applications. Unfortunately, the potential of common neural network architectures in capturing invariant properties of time series remains relatively underexplored. For instance, convolutional layers often fail to capture underlying patterns in time series inputs that encompass strong deformations, such as trends. Indeed, invariances to some deformations may be critical for solving complex time series tasks, such as classification, while guaranteeing good generalization performance. To address these challenges, we mathematically formulate and technically design efficient and hard-codedinvariant convolutionsfor specific group actions applicable to the case of time series. We construct these convolutions by considering specific sets of deformations commonly observed in time series, includingscaling,offset shift, andtrend. We further combine the proposed invariant convolutions with standard convolutions in single embedding layers, and we showcase the layer capacity to capture complex invariant time series properties in several scenarios.

Paperid:793 poster

Authors:Ke Fan, Jinpeng Zhang, Xuefeng Zhang, Yunze Wu, Jingyu Cao, Yuan Zhou, Jianzhu Ma

Title: Safety-Polarized and Prioritized Reinforcement Learning

Abstract:
Motivated by the first priority of safety in many real-world applications, we propose \textsc{MaxSafe}, a chance-constrained bi-level optimization framework for safe reinforcement learning. \textsc{MaxSafe} first minimizes the unsafe probability and then maximizes the return among the safest policies. We provide a tailored Q-learning algorithm for the \textsc{MaxSafe} objective, featuring a novel learning process for \emph{optimal action masks} with theoretical convergence guarantees. To enable the application of our algorithm to large-scale experiments, we introduce two key techniques: \emph{safety polarization} and \emph{safety prioritized experience replay}. Safety polarization generalizes the optimal action masking by polarizing the Q-function, which assigns low values to unsafe state-action pairs, effectively discouraging their selection. In parallel, safety prioritized experience replay enhances the learning of optimal action masks by prioritizing samples based on temporal-difference (TD) errors derived from our proposed state-action reachability estimation functions. This approach efficiently addresses the challenges posed by sparse cost signals. Experiments on diverse autonomous driving and safe control tasks show that our methods achieve near-maximal safety and an optimal reward-safety trade-off.

Paperid:794 poster

Authors:Ryan McKenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, Georgios Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang

Title: Scaling Laws for Differentially Private Language Models

Abstract:
Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility and the optimal training configurations in many settings.

Paperid:795 poster

Authors:Sang T. Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo

Title: Reliable and Efficient Amortized Model-based Evaluation

Abstract:
Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models are thought to possess numerous capabilities as well as safety risks. The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 183 LMs show that this approach is more reliable and efficient compared to the current common practice.

Paperid:796 poster

Authors:Bairu Hou, Qibin Chen, Jianyu Wang, Guoli Yin, Chong Wang, Nan Du, Ruoming Pang, Shiyu Chang, Tao Lei

Title: Instruction-Following Pruning for Large Language Models

Abstract:
With the rapid scaling of large language models (LLMs), structured pruning has become a widely used technique to learn efficient, smaller models from larger ones, delivering superior performance compared to training similarly sized models from scratch. In this paper, we move beyond the traditional static pruning approach of determining a fixed pruning mask for a model, and propose a dynamic approach to structured pruning. In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction. Our approach, termed "instruction-following pruning'', introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task. To identify and activate effective parameters, we jointly optimize the sparse mask predictor and the LLM, leveraging both instruction-following data and the pre-training corpus. Experimental results demonstrate the effectiveness of our approach on a wide range of evaluation benchmarks. For example, our 3B activated model improves over the 3B dense model by 5-8 points of absolute margin on domains such as math and coding, and rivals the performance of a 9B model.

Paperid:797 poster

Authors:Anna Thomas, Adam Yee, Andrew Mayne, Maya B. Mathur, Dan Jurafsky, Kristina Gligorić

Title: What can large language models do for sustainable food?

Abstract:
Food systems are responsible for a third of human-caused greenhouse gas emissions. We investigate what Large Language Models (LLMs) can contribute to reducing the environmental impacts of food production. We define a typology of design and prediction tasks based on the sustainable food literature and collaboration with domain experts, and evaluate six LLMs on four tasks in our typology. For example, for a sustainable protein design task, food science experts estimated that collaboration with an LLM can reduce time spent by 45% on average, compared to 22% for collaboration with another expert human food scientist. However, for a sustainable menu design task, LLMs produce suboptimal solutions when instructed to consider both human satisfaction and climate impacts. We propose a general framework for integrating LLMs with combinatorial optimization to improve reasoning capabilities. Our approach decreases emissions of food choices by 79% in a hypothetical restaurant while maintaining participants' satisfaction with their set of choices. Our results demonstrate LLMs' potential, supported by optimization techniques, to accelerate sustainable food development and adoption.

Paperid:798 poster

Authors:Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Bin Yu, Kannan Ramchandran

Title: SPEX: Scaling Feature Interaction Explanations for LLMs

Abstract:
Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP providemarginalfeature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We proposeSpectral Explainer(SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions—common in real-world data—and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets,HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.

Paperid:799 poster

Authors:Vignav Ramesh, Kenneth Li

Title: Communicating Activations Between Language Model Agents

Abstract:
Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate viaactivations; concretely, we pause an LM $B$'s computation at an intermediate layer, combine its current activation with another LM $A$'s intermediate activation via some function $f$, then pass $f$'s output into the next layer of $B$ and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks withzeroadditional parameters and data, and saves asubstantial amount of computeover natural language communication. We test our method with various functional forms $f$ on two experimental setups—multi-player coordination games and reasoning benchmarks—and find that it achieves up to $27$% improvement over natural language communication across datasets with $<$$1/4$ the compute, illustrating the superiority and robustness of activations as an alternative "language" for communication between LMs.

Paperid:800 poster

Authors:Ron Tsibulsky, Daniel Nevo, Uri Shalit

Title: Set Valued Predictions For Robust Domain Generalization

Abstract:
Despite the impressive advancements in modern machine learning, achieving robustness in Domain Generalization (DG) tasks remains a significant challenge. In DG, models are expected to perform well on samples from unseen test distributions (also called domains), by learning from multiple related training distributions. Most existing approaches to this problem rely on single-valued predictions, which inherently limit their robustness. We argue that set-valued predictors could be leveraged to enhance robustness across unseen domains, while also taking into account that these sets should be as small as possible. We introduce a theoretical framework defining successful set prediction in the DG setting, focusing on meeting a predefined performance criterion across as many domains as possible, and provide theoretical insights into the conditions under which such domain generalization is achievable. We further propose a practical optimization method compatible with modern learning architectures, that balances robust performance on unseen domains with small prediction set sizes. We evaluate our approach on several real-world datasets from the WILDS benchmark, demonstrating its potential as a promising direction for robust domain generalization.

Paperid:801 poster

Authors:Haoyang Jiang, Jindong Wang, Xingquan Zhu, Yi He

Title: Topology-aware Neural Flux Prediction Guided by Physics

Abstract:
Graph Neural Networks (GNNs) often struggle in preserving high-frequency components of nodal signals when dealing with directed graphs. Such components are crucial for modeling flow dynamics, without which a traditional GNN tends to treat a graph with forward and reverse topologies equal. To make GNNs sensitive to those high-frequency components thereby being capable to capture detailed topological differences, this paper proposes a novel framework that combines 1) explicit difference matrices that model directional gradients and 2) implicit physical constraints that enforce messages passing within GNNs to be consistent with natural laws. Evaluations on two real-world directed graph data, namely, water flux network and urban traffic flow network, demonstrate the effectiveness of our proposal.

Paperid:802 poster

Authors:Emanuele Sansone, Tim Lebailly, Tinne Tuytelaars

Title: Collapse-Proof Non-Contrastive Self-Supervised Learning

Abstract:
We present a principled and simplified design of the projector and loss function for non-contrastive self-supervised learning based on hyperdimensional computing. We theoretically demonstrate that this design introduces an inductive bias that encourages representations to be simultaneously decorrelated and clustered, without explicitly enforcing these properties. This bias provably enhances generalization and suffices to avoid known training failure modes, such as representation, dimensional, cluster, and intracluster collapses. We validate our theoretical findings on image datasets, including SVHN, CIFAR-10, CIFAR-100, and ImageNet-100. Our approach effectively combines the strengths of feature decorrelation and cluster-based self-supervised learning methods, overcoming training failure modes while achieving strong generalization in clustering and linear classification tasks.

Paperid:803 poster

Authors:Rohan Ghuge, Vidya Muthukumar, Sahil Singla

Title: Improved and Oracle-Efficient Online�?-Multicalibration

Abstract:
We studyonline multicalibration, a framework for ensuring calibrated predictions across multiple groups in adversarial settings, across $T$ rounds. Although online calibration is typically studied in the $\ell_1$ norm, prior approaches to online multicalibration have taken the indirect approach of obtaining rates in other norms (such as $\ell_2$ and $\ell_{\infty}$) and then transferred these guarantees to $\ell_1$ at additional loss. In contrast, we propose a direct method that achieves improved and oracle-efficient rates of $\widetilde{\mathcal{O}}(T^{-1/3})$ and $\widetilde{\mathcal{O}}(T^{-1/4})$ respectively, for online $\ell_1$-multicalibration. Our key insight is a novel reduction of online $\ell_1$-multicalibration to an online learning problem with product-based rewards, which we refer to asonline linear-product optimization($\mathtt{OLPO}$).To obtain the improved rate of $\widetilde{\mathcal{O}}(T^{-1/3})$, we introduce a linearization of $\mathtt{OLPO}$ and design a no-regret algorithm for this linearized problem. Although this method guarantees the desired sublinear rate (nearly matching the best rate for online calibration), it is computationally expensive when the group family $\mathcal{H}$ is large or infinite, since it enumerates all possible groups. To address scalability, we propose a second approach to $\mathtt{OLPO}$ that makes only a polynomial number of calls to an offline optimization (multicalibration evaluation) oracle, resulting inoracle-efficientonline $\ell_1$-multicalibration with a corresponding rate of $\widetilde{\mathcal{O}}(T^{-1/4})$. Our framework also extends to certain infinite families of groups (e.g., all linear functions on the context space) by exploiting a $1$-Lipschitz property of the $\ell_1$-multicalibration error with respect to $\mathcal{H}$.

Paperid:804 poster

Authors:Tom Labiausse, Laurent Mazaré, Edouard Grave, Alexandre Défossez, Neil Zeghidour

Title: High-Fidelity Simultaneous Speech-To-Speech Translation

Abstract:
We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart --where one waits for the end of the source utterance to start translating-- adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples onhuggingface.co/spaces/kyutai/hibiki-samplesas well as models and inference code atgithub.com/kyutai-labs/hibiki.

Paperid:805 poster

Authors:Vincent Roulet, Tianlin Liu, Nino Vieillard, Michael Eli Sander, Mathieu Blondel

Title: Loss Functions and Operators Generated by f-Divergences

Abstract:
The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback-Leibler (KL) divergence and the softargmax operator.In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones.By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence.On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $\alpha$-divergence (which is equivalent to Tsallis $\alpha$-negentropy in the case of unit reference measures) with $\alpha=1.5$ performs well across several tasks.

Paperid:806 poster

Authors:Wa-Kin Lei, Jun-Cheng Chen, Shang-Tse Chen

Title: DRAG: Data Reconstruction Attack using Guided Diffusion

Abstract:
With the rise of large foundation models, split inference (SI) has emerged as a popular computational paradigm for deploying models across lightweight edge devices and cloud servers, addressing data privacy and computational cost concerns. However, most existing data reconstruction attacks have focused on smaller CNN classification models, leaving the privacy risks of foundation models in SI settings largely unexplored. To address this gap, we propose a novel data reconstruction attack based on guided diffusion, which leverages the rich prior knowledge embedded in a latent diffusion model (LDM) pre-trained on a large-scale dataset. Our method performs iterative reconstruction on the LDM’s learned image prior, effectively generating high-fidelity images resembling the original data from their intermediate representations (IR). Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, both qualitatively and quantitatively, in reconstructing data from deep-layer IRs of the vision foundation model. The results highlight the urgent need for more robust privacy protection mechanisms for large models in SI scenarios.

Paperid:807 poster

Authors:Yogesh Verma, Amauri H Souza, Vikas K Garg

Title: Positional Encoding meets Persistent Homology on Graphs

Abstract:
The local inductive bias of message-passing graph neural networks (GNNs) hampers their ability to exploit key structural information (e.g., connectivity and cycles). Positional encoding (PE) and Persistent Homology (PH) have emerged as two promising approaches to mitigate this issue. PE schemes endow GNNs with location-aware features, while PH methods enhance GNNs with multiresolution topological features. However, a rigorous theoretical characterization of the relative merits and shortcomings of PE and PH has remained elusive. We bridge this gap by establishing that neither paradigm is more expressive than the other, providing novel constructions where one approach fails but the other succeeds. Our insights inform the design of a novel learnable method, PiPE (Persistence-informed Positional Encoding), which is provably more expressive than both PH and PE. PiPE demonstrates strong performance across a variety of tasks (e.g., molecule property prediction, graph classification, and out-of-distribution generalization), thereby advancing the frontiers of graph representation learning. Code is available athttps://github.com/Aalto-QuML/PIPE

Paperid:808 poster

Authors:Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schuetze

Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching

Abstract:
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code athttps://github.com/adobe-research/NoLiMa.

Paperid:809 poster

Authors:Alexandre Capone, Ryan Kazuo Cosner, Aaron Ames, Sandra Hirche

Title: Learning Safe Control via On-the-Fly Bandit Exploration

Abstract:
Control tasks with safety requirements under high levels of model uncertainty are increasingly common. Machine learning techniques are frequently used to address such tasks, typically by leveraging model error bounds to specify robust constraint-based safety filters. However, if the learned model uncertainty is very high, the corresponding filters are potentially invalid, meaning no control input satisfies the constraints imposed by the safety filter. While most works address this issue by assuming some form of safe backup controller, ours tackles it by collecting additional data on the fly using a Gaussian process bandit-type algorithm. We combine a control barrier function with a learned model to specify a robust certificate that ensures safety if feasible. Whenever infeasibility occurs, we leverage the control barrier function to guide exploration, ensuring the collected data contributes toward the closed-loop system safety. By combining a safety filter with exploration in this manner, our method provably achieves safety in a general setting that does not require any prior model or backup controller, provided that the true system lies in a reproducing kernel Hilbert space. To the best of our knowledge, it is the first safe learning-based control method that achieves this.

Paperid:810 poster

Authors:Atefeh Gilani, Juan Felipe Gomez, Shahab Asoodeh, Flavio Calmon, Oliver Kosut, Lalitha Sankar

Title: Optimizing Noise Distributions for Differential Privacy

Abstract:
We propose a unified optimization framework for designing continuous and discrete noise distributions that ensure differential privacy (DP) by minimizing Rényi DP, a variant of DP, under a cost constraint. Rényi DP has the advantage that by considering different values of the Rényi parameter $\alpha$, we can tailor our optimization for any number of compositions. To solve the optimization problem, we reduce it to a finite-dimensional convex formulation and perform preconditioned gradient descent. The resulting noise distributions are then compared to their Gaussian and Laplace counterparts. Numerical results demonstrate that our optimized distributions are consistently better, with significant improvements in $(\varepsilon, \delta)$-DP guarantees in the moderate composition regimes, compared to Gaussian and Laplace distributions with the same variance.

Paperid:811 poster

Authors:Yuhao Mao, Stefan Balauca, Martin Vechev

Title: CTBench: A Library and Benchmark for Certified Training

Abstract:
Training certifiably robust neural networks is an important but challenging task. While many algorithms for (deterministic) certified training have been proposed, they are often evaluated on different training schedules, certification methods, and systematically under-tuned hyperparameters, making it difficult to compare their performance. To address this challenge, we introduce CTBench, a unified library and a high-quality benchmark for certified training that evaluates all algorithms under fair settings and systematically tuned hyperparameters. We show that (1) almost all algorithms in CTBench surpass the corresponding reported performance in literature in the magnitude of algorithmic improvements, thus establishing new state-of-the-art, and (2) the claimed advantage of recent algorithms drops significantly when we enhance the outdated baselines with a fair training schedule, a fair certification method and well-tuned hyperparameters. Based on CTBench, we provide new insights into the current state of certified training, including (1) certified models have less fragmented loss surface, (2) certified models share many mistakes, (3) certified models have more sparse activations, (4) reducing regularization cleverly is crucial for certified training especially for large radii and (5) certified training has the potential to improve out-of-distribution generalization. We are confident that CTBench will serve as a benchmark and testbed for future research in certified training.

Paperid:812 poster

Authors:Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek

Title: Interpreting CLIP with Hierarchical Sparse Autoencoders

Abstract:
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern large-scale systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining 80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available athttps://github.com/WolodjaZ/MSAE.

Paperid:813 poster

Authors:Haoran Zhang, Aparna Balagopalan, Nassim Oufattole, Hyewon Jeong, Yan Wu, Jiacheng Zhu, Marzyeh Ghassemi

Title: LEMoN: Label Error Detection using Multimodal Neighbors

Abstract:
Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled instances. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions. However, beyond filtering based on image-caption embedding similarity, no prior works have proposed other methods to filter noisy multimodal data, or concretely assessed the impact of noisy captioning data on downstream training. In this work, we propose, theoretically justify, and empirically validate LEMoN, a method to identify label errors in image-caption datasets. Our method leverages the multimodal neighborhood of image-caption pairs in the latent space of contrastively pretrained multimodal models to automatically identify label errors. Through empirical evaluations across eight datasets and twelve baselines, we find that LEMoN outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.

Paperid:814 poster

Authors:Yuxuan Wang, Mingzhou Liu, Xinwei Sun, Wei Wang, Yizhou Wang

Title: Bayesian Active Learning for Bivariate Causal Discovery

Abstract:
Determining the direction of relationships between variables is fundamental for understanding complex systems across scientific domains. While observational data can uncover relationships between variables, it cannot distinguish between cause and effect without experimental interventions. To effectively uncover causality, previous works have proposed intervention strategies that sequentially optimize the intervention values. However, most of these approaches primarily maximized information-theoretic gains that may not effectively measure the reliability of direction determination. In this paper, we formulate the causal direction identification as a hypothesis-testing problem, and propose a Bayes factor-based intervention strategy, which can quantify the evidence strength of one hypothesis (e.g., causal) over the other (e.g., non-causal). To balance the immediate and future gains of testing strength, we propose a sequential intervention objective over intervention values in multiple steps. By analyzing the objective function, we develop a dynamic programming algorithm that reduces the complexity from non-polynomial to polynomial. Experimental results on bivariate systems, tree-structured graphs, and an embodied AI environment demonstrate the effectiveness of our framework in direction determination and its extensibility to both multivariate settings and real-world applications.

Paperid:815 poster

Authors:He Zhang, Ming Zhou, Shaopeng Zhai, Ying Sun, Hui Xiong

Title: Efficient Skill Discovery via Regret-Aware Optimization

Abstract:
Unsupervised skill discovery aims to learn diverse and distinguishable behaviors in open-ended reinforcement learning. For the existing methods, they focus on improving the diversity via pure exploration, mutual information optimization and learning temporal representation. Despite they perform well on exploration, they remain limited in terms of efficiency, especially for the high-dimensional situations. In this work, we frame the skill discovery as a min-max game of skill generation and policy learning, proposing a regret-aware method on top of temporal representation learning that expands the discovered skill space along the direction of upgradable policy strength. The key insight behind the proposed method is that the skill discovery is adversarial to the policy learning, i.e., skills with weak strength should be further explored while less exploration for the skills with converged strength. As an implementation, we score the degree of strength convergence with regret, and guide the skill discovery with a learnable skill generator. To avoid degeneration, the skill generation comes from an upgradable population of skill generators. We conduct experiments on environments with varying complexities and dimension sizes. Empirical results show that our method outperforms baselines on both efficiency and diversity. Moreover, our method achieves 15% zero-shot improvement on high-dimensional environments, compared to existing methods.

Paperid:816 poster

Authors:Wenzhi Feng, Xutao Li, Zhe Wu, Kenghong Lin, Demin Yu, Yunming Ye, Yaowei Wang

Title: Perceptually Constrained Precipitation Nowcasting Model

Abstract:
Most current precipitation nowcasting methods aim to capture the underlying spatiotemporal dynamics of precipitation systems by minimizing the mean square error (MSE). However, these methods often neglect effective constraints on the data distribution, leading to unsatisfactory prediction accuracy and image quality, especially for long forecast sequences. To address this limitation, we propose a precipitation nowcasting model incorporating perceptual constraints. This model reformulates precipitation nowcasting as a posterior MSE problem under such constraints. Specifically, we first obtain the posteriori mean sequences of precipitation forecasts using a precipitation estimator. Subsequently, we construct the transmission between distributions using rectified flow. To enhance the focus on distant frames, we design a frame sampling strategy that gradually increases the corresponding weights. We theoretically demonstrate the reliability of our solution, and experimental results on two publicly available radar datasets demonstrate that our model is effective and outperforms current state-of-the-art models.

Paperid:817 poster

Authors:Ohad Einav, Nir Rosenfeld

Title: A Market for Accuracy: Classification Under Competition

Abstract:
Machine learning models play a key role for service providers looking to gain market share in consumer markets. However, traditional learning approaches do not take into account the existence of additional providers, who compete with each other for consumers. Our work aims to study learning in this market setting, as it affects providers, consumers, and the market itself. We begin by analyzing such markets through the lens of the learning objective, and show that accuracy cannot be the only consideration. We then propose a method for classification under competition, so that a learner can maximize market share in the presence of competitors. We show that our approach benefits the providers as well as the consumers, and find that the timing of market entry and model updates can be crucial. We display the effectiveness of our approach across a range of domains, from simple distributions to noisy datasets, and show that the market as a whole remains stable by converging quickly to an equilibrium.

Paperid:818 poster

Authors:Jingyue Gao, Runji Lin, Keming Lu, Bowen Yu, Junyang Lin, Jianyu Chen

Title: MARGE: Improving Math Reasoning with Guided Exploration

Abstract:
Large Language Models (LLMs) exhibit strong potential in mathematical reasoning, yet their effectiveness is often limited by a shortage of high-quality queries. This limitation necessitates scaling up computational responses through self-generated data, yet current methods struggle due to spurious correlated data caused by ineffective exploration across all reasoning stages. To address such challenge, we introduceMARGE: ImprovingMathReasoning withGuidedExploration, a novel method that enhances mathematical reasoning through hit-guided exploration. MARGE systematically explores intermediate reasoning states derived from self-generated solutions, enabling adequate exploration and improved credit assignment throughout the reasoning process. Notably, MARGE improves both single-shot accuracy and exploration diversity, mitigating a common trade-off in alignment methods. These results demonstrate MARGE's effectiveness in enhancing mathematical reasoning capabilities and unlocking the potential of scaling self-generated training data.

Paperid:819 poster

Authors:Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang

Title: DLP: Dynamic Layerwise Pruning in Large Language Models

Abstract:
Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code\footnote{The code is available at: \url{https://github.com/ironartisan/DLP}.}to facilitate future research.

Paperid:820 poster

Authors:Zhengyi Li, Yue Guan, Kang Yang, Yu Feng, Ning Liu, Yu Yu, Jingwen Leng, Minyi Guo

Title: An Efficient Private GPT Never Autoregressively Decodes

Abstract:
The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead. To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.

Paperid:821 poster

Authors:Jake Robertson, Noah Hollmann, Samuel Müller, Noor Awad, Frank Hutter

Title: FairPFN: A Tabular Foundation Model for Causal Fairness

Abstract:
Machine learning (ML) systems are utilized in critical sectors such as healthcare, law enforcement, and finance, but often rely on historical data that contains demographic biases, leading to decisions that perpetuate or intensify existing inequalities. Causal and counterfactual fairness provide a transparent, human-in-the-loop framework to mitigate algorithmic discrimination, aligning closely with legal doctrines of direct and indirect discrimination. However, current causal fairness frameworks hold a key limitation in that they assume prior knowledge of the correct causal model, restricting their applicability in complex fairness scenarios where causal models are unknown or difficult to identify. To bridge this gap, we propose FairPFN, a tabular foundation model pre-trained on synthetic causal fairness data to identify and mitigate the causal effects of protected attributes in its predictions. FairPFN's key contribution is that it requires no knowledge of the causal model and demonstrates strong performance across a diverse set of hand-crafted and real-world causal scenarios relative to robust baseline methods. FairPFN paves the way for a promising direction for future research, making causal fairness more accessible to a wider variety of complex fairness problems.

Paperid:822 poster

Authors:Longhui Zhang, Bin Wang, Jiahao Wang, Xiaofeng Zhao, Min Zhang, Hao Yang, Meishan Zhang, YU LI, Jing Li, Jun Yu, Min Zhang

Title: Function-to-Style Guidance of LLMs for Code Translation

Abstract:
Large language models (LLMs) have made significant strides in code translation tasks. However, ensuring both the correctness and readability of translated code remains a challenge, limiting their effective adoption in real-world software development. In this work, we propose F2STrans, a function-to-style guiding paradigm designed to progressively improve the performance of LLMs in code translation. Our approach comprises two key stages: (1) Functional learning, which optimizes translation correctness using high-quality source-target code pairs mined from online programming platforms, and (2) Style learning, which improves translation readability by incorporating both positive and negative style examples. Additionally, we introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, enabling comprehensive functional and stylistic evaluations. Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 diverse code translation scenarios.

Paperid:823 poster

Authors:William Laplante, Matias Altamirano, Andrew B. Duncan, Jeremias Knoblauch, Francois-Xavier Briol

Title: Robust and Conjugate Spatio-Temporal Gaussian Processes

Abstract:
State-space formulations allow for Gaussian process (GP) regression with linear-in-time computational cost in spatio-temporal settings, but performance typically suffers in the presence of outliers. In this paper, we adapt and specialise therobust and conjugate GP (RCGP)framework of Altamirano et al. (2024) to the spatio-temporal setting. In doing so, we obtain an outlier-robust spatio-temporal GP with a computational cost comparable to classical spatio-temporal GPs. We also overcome the three main drawbacks of RCGPs: their unreliable performance when the prior mean is chosen poorly, their lack of reliable uncertainty quantification, and the need to carefully select a hyperparameter by hand. We study our method extensively in finance and weather forecasting applications, demonstrating that it provides a reliable approach to spatio-temporal modelling in the presence of outliers.

Paperid:824 poster

Authors:Bariscan Kurtkaya, Fatih Dinc, Mert Yuksekgonul, Marta Blanco-Pozo, Ege Cirakman, Mark Schnitzer, Yucel Yemez, Hidenori Tanaka, Peng Yuan, Nina Miolane

Title: Dynamical phases of short-term memory mechanisms in RNNs

Abstract:
Short-term memory is essential for cognitive processing, yet our understanding of its neural mechanisms remains unclear. Neuroscience has long focused on how sequential activity patterns, where neurons fire one after another within large networks, can explain how information is maintained. While recurrent connections were shown to drive sequential dynamics, a mechanistic understanding of this process still remains unknown. In this work, we introduce two unique mechanisms that can support this form of short-term memory: slow-point manifolds generating direct sequences or limit cycles providing temporally localized approximations. Using analytical models, we identify fundamental properties that govern the selection of each mechanism. Precisely, on short-term memory tasks (delayed cue-discrimination tasks), we derive theoretical scaling laws for critical learning rates as a function of the delay period length, beyond which no learning is possible. We empirically verify these results by training and evaluating approximately 80,000 recurrent neural networks (RNNs), which are publicly available for further analysis. Overall, our work provides new insights into short-term memory mechanisms and proposes experimentally testable predictions for systems neuroscience.

Paperid:825 poster

Authors:Kevin Muyuan Xia, Elias Bareinboim

Title: Causal Abstraction Inference under Lossy Representations

Abstract:
The study of causal abstractions bridges two integral components of human intelligence: the ability to determine cause and effect, and the ability to interpret complex patterns into abstract concepts. Formally, causal abstraction frameworks define connections between complicated low-level causal models and simple high-level ones. One major limitation of most existing definitions is that they are not well-defined when considering lossy abstraction functions in which multiple low-level interventions can have different effects while mapping to the same high-level intervention (an assumption called the abstract invariance condition). In this paper, we introduce a new type of abstractions called projected abstractions that generalize existing definitions to accommodate lossy representations. We show how to construct a projected abstraction from the low-level model and how it translates equivalent observational, interventional, and counterfactual causal queries from low to high-level. Given that the true model is rarely available in practice we prove a new graphical criteria for identifying and estimating high-level causal queries from limited low-level data. Finally, we experimentally show the effectiveness of projected abstraction models in high-dimensional image settings.

Paperid:826 poster

Authors:David Salinas, Omar Swelam, Frank Hutter

Title: Tuning LLM Judge Design Decisions for 1/1000 of the Cost

Abstract:
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.

Paperid:827 poster

Authors:Matthew Faw, Rajat Sen, Yichen Zhou, Abhimanyu Das

Title: In-Context Fine-Tuning for Time-Series Foundation Models

Abstract:
Motivated by the recent success of time-series foundation models for zero-shot forecasting, we present a methodology forin-context fine-tuningof a time-series foundation model. In particular, we design a pretrained foundation model that can be prompted (at inference time) with multiple time-series examples, in order to forecast a target time-series into the future. Our foundation model is specifically trained to utilize examples from multiple related time-series in its context window (in addition to the history of the target time-series) to help it adapt to the specific distribution of the target domain at inference time. We show that such a foundation model that uses in-context examples at inference time can obtain much better performance on popular forecasting benchmarks compared to supervised deep learning methods, statistical models, and other time series foundation models. Interestingly, our in-context fine-tuning approach even matches the performance of a foundation model that is explicitly fine-tuned on the target domain.

Paperid:828 poster

Authors:Guangtao Zheng, Wenqian Ye, Aidong Zhang

Title: NeuronTune: Towards Self-Guided Spurious Bias Mitigation

Abstract:
Deep neural networks often develop spurious bias, reliance on correlations between non-essential features and classes for predictions. For example, a model may identify objects based on frequently co-occurring backgrounds rather than intrinsic features, resulting in degraded performance on data lacking these correlations. Existing mitigation approaches typically depend on external annotations of spurious correlations, which may be difficult to obtain and are not relevant to the spurious bias in a model. In this paper, we take a step towards self-guided mitigation of spurious bias by proposing NeuronTune, a post hoc method that directly intervenes in a model's internal decision process. Our method probes in a model's latent embedding space to identify and regulate neurons that lead to spurious prediction behaviors. We theoretically justify our approach and show that it brings the model closer to an unbiased one. Unlike previous methods, NeuronTune operates without requiring spurious correlation annotations, making it a practical and effective tool for improving model robustness. Experiments across different architectures and data modalities demonstrate that our method significantly mitigates spurious bias in a self-guided way.

Paperid:829 poster

Authors:Yuri Gardinazzi, Karthik Viswanathan, Giada Panerai, Alessio ansuini, Alberto Cazzaniga, Matteo Biagetti

Title: Persistent Topological Features in Large Language Models

Abstract:
Understanding the decision-making processes of large language models is critical given their widespread applications. To achieve this, we aim to connect a formal mathematical framework—zigzag persistence from topological data analysis —with practical and easily applicable algorithms. Zigzag persistence is particularly effective for characterizing data as it dynamically transforms across model layers. Within this framework, we introduce topological descriptors that measure how topological features, $p$-dimensional holes, persist and evolve throughout the layers. Unlike methods that assess each layer individually and then aggregate the results, our approach directly tracks the full evolutionary path of these features. This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space, providing insights into the system’s operation as an integrated whole. To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-of-the-art methods while preserving the system-level perspective.

Paperid:830 poster

Authors:Tam Le, Truyen Nguyen, Hideitsu Hino, Kenji Fukumizu

Title: Scalable Sobolev IPM for Probability Measures on a Graph

Abstract:
We investigate the Sobolev IPM problem for probability measures supported on a graph metric space. Sobolev IPM is an important instance of integral probability metrics (IPM), and is obtained by constraining a critic function within a unit ball defined by the Sobolev norm. In particular, it has been used to compare probability measures and is crucial for several theoretical works in machine learning. However, to our knowledge, there are no efficient algorithmic approaches to compute Sobolev IPM effectively, which hinders its practical applications. In this work, we establish a relation between Sobolev norm and weightedLp-norm, and leverage it to propose anovel regularizationfor Sobolev IPM. By exploiting the graph structure, we demonstrate that the regularized Sobolev IPM provides aclosed-formexpression for fast computation. This advancement addresses long-standing computational challenges, and paves the way to apply Sobolev IPM for practical applications, even in large-scale settings. Additionally, the regularized Sobolev IPM is negative definite. Utilizing this property, we design positive-definite kernels upon the regularized Sobolev IPM, and provide preliminary evidences of their advantages for comparing probability measures on a given graph for document classification and topological data analysis.

Paperid:831 poster

Authors:Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang

Title: CursorCore: Assist Programming through Aligning Anything

Abstract:
Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, code context, and user instructions. In this work, we propose a new framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants.

Paperid:832 poster

Authors:Rom Parnichkun, Neehal Tumma, Armin W Thomas, Alessandro Moro, Qi An, Taiji Suzuki, Atsushi Yamashita, Michael Poli, Stefano Massaroli

Title: Quantifying Memory Utilization with Effective State-Size

Abstract:
As the space of causal sequence modeling architectures continues to grow, the need to develop a general framework for their analysis becomes increasingly important. With this aim, we draw insights from classical signal processing and control theory, to develop a quantitative measure ofmemory utilization: the internal mechanisms through which a model stores past information to produce future outputs. This metric, which we calleffective state-size(ESS), is tailored to the fundamental class of systems withinput-invariantandinput-varying linear operators, encompassing a variety of computational units such as variants of attention, convolutions, and recurrences. Unlike prior work on memory utilization, which either relies on raw operator visualizations (e.g. attention maps), or simply the totalmemory capacity(i.e. cache size) of a model, our metrics provide highly interpretable and actionable measurements. In particular, we show how ESS can be leveraged to improve initialization strategies, inform novel regularizers and advance the performance-efficiency frontier through model distillation. Furthermore, we demonstrate that the effect of context delimiters (such as end-of-speech tokens) on ESS highlights cross-architectural differences in how large language models utilize their available memory to recall information. Overall, we find that ESS provides valuable insights into the dynamics that dictate memory utilization, enabling the design of more efficient and effective sequence models.

Paperid:833 poster

Authors:Irmak Sivgin, Sara Fridovich-Keil, Gordon Wetzstein, Mert Pilanci

Title: Geometric Algebra Planes: Convex Implicit Neural Volumes

Abstract:
Volume parameterizations abound in recent literature, encompassing methods from classic voxel grids to implicit neural representations. While implicit representations offer impressive capacity and improved memory efficiency compared to voxel grids, they traditionally require training through nonconvex optimization, which can be slow and sensitive to initialization and hyperparameters. We introduce GA-Planes, a novel family of implicit neural volume representations inspired by Geometric Algebra that can be trained using convex optimization, addressing the limitations of nonconvex methods. GA-Planes models generalize many existing representations including any combination of features stored in tensor basis elements followed by a neural feature decoder, and can be adapted to convex or nonconvex training as needed for various inverse problems. In the 2D setting, we prove GA-Planes models are equivalent to a low-rank plus low-resolution matrix factorization that outperforms the classic low-rank plus sparse decomposition for fitting a natural image. In 3D, GA-Planes models exhibit competitive expressiveness, model size, and optimizability across tasks such as radiance field reconstruction, 3D segmentation, and video segmentation.

Paperid:834 poster

Authors:Aayushya Agarwal, Gauri Joshi, Lawrence Pileggi

Title: FedECADO: A Dynamical System Model of Federated Learning

Abstract:
Federated learning harnesses the power of distributed optimization to train a unified machine learning model across separate clients. However, heterogeneous data distributions and computational workloads can lead to inconsistent updates and limit model performance. This work tackles these challenges by proposing FedECADO, a new algorithm inspired by a dynamical system representation of the federated learning process. FedECADO addresses non-IID data distribution through an aggregate sensitivity model that reflects the amount of data processed by each client. To tackle heterogeneous computing, we design a multi-rate integration method with adaptive step-size selections that synchronizes active client updates in continuous time. Compared to prominent techniques, including FedProx, FedExp, and FedNova, FedECADO achieves higher classification accuracies in numerous heterogeneous scenarios.

Paperid:835 poster

Authors:Pedro Cisneros-Velarde, Bhavesh Shrimali, Arindam Banerjee

Title: Optimization for Neural Operators can Benefit from Width

Abstract:
Neural Operators that directly learn mappings between function spaces, such as Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs), have received considerable attention. Despite the universal approximation guarantees for DONs and FNOs, there is currently no optimization convergence guarantee for learning such networks using gradient descent (GD). In this paper, we address this open problem by presenting a unified framework for optimization based on GD and applying it to establish convergence guarantees for both DONs and FNOs. In particular, we show that the losses associated with both of these neural operators satisfy two conditions—restricted strong convexity (RSC) and smoothness—that guarantee a decrease on their loss values due to GD. Remarkably, these two conditions are satisfied for each neural operator due to different reasons associated with the architectural differences of the respective models. One takeaway that emerges from the theory is that wider networks benefit optimization convergence guarantees for both DONs and FNOs. We present empirical results on canonical operator learning problems to support our theoretical results and find that larger widths benefit training.

Paperid:836 poster

Authors:Stathi Fotiadis, Noah D Brenowitz, Tomas Geffner, Yair Cohen, Michael Pritchard, Arash Vahdat, Morteza Mardani

Title: Adaptive Flow Matching for Resolving Small-Scale Physics

Abstract:
Conditional diffusion and flow models are effective for super-resolving small-scale details in natural images. However, in physical sciences such as weather, three major challenges arise: (i) spatially misaligned input-output distributions (PDEs at different resolutions lead to divergent trajectories), (ii) misaligned and distinct input-output channels (channel synthesis), (iii) several channels with diverse stochasticity scales (multiscale). To address these, we propose to first encode inputs into a latent base distribution that is closer to the target, then apply Flow Matching to generate small-scale physics. The encoder captures deterministic components, while Flow Matching adds stochastic details. To handle uncertainty in the deterministic part, we inject noise via an adaptive noise scaling mechanism, dynamically adjusted by maximum-likelihood estimates of the encoder’s predictions. Experiments on real-world weather data (including super-resolution from 25 km to 2 km scales in Taiwan) and in synthetic Kolmogorov flow datasets show that our proposed Adaptive Flow Matching (AFM) framework outperforms existing methods and produces better-calibrated ensembles.

Paperid:837 poster

Authors:Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander Nicholas D'Amour, Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh

Title: Theoretical guarantees on the best-of-n alignment policy

Abstract:
A simple and effective method for the inference-time alignment of generative models is the best-of-npolicy, wherensamples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-npolicy and the reference policy is equal tolog�?n)�?n�?)/n.We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-npolicy against the reference policy is upper bounded byn/(n+1)and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-nalignment policy, which demonstrate that very good tradeoffs are achievable withn<1000.

Paperid:838 poster

Authors:Yong Liu, Di Fu, Shenggan Cheng, Zirui Zhu, Yang Luo, Minhao Cheng, Cho-Jui Hsieh, Yang You

Title: SeedLoRA: A Fusion Approach to Efficient LLM Fine-Tuning

Abstract:
Despite Low-Rank Adaptation (LoRA)'s popularity for fine-tuning large models, it often exhibits a noticeable performance gap compared to full fine-tuning, particularly in complex tasks such as mathematical reasoning and code generation. Motivated by this discrepancy, we propose a novel fusion approach for LoRA fine-tuned models. Our key insight is that LoRA models trained with different random seeds on the same task often exhibit complementary strengths. In contrast to existing research that typically focuses on fusing models trained on diverse tasks, we explore the potential of combining multiple LoRA models fine-tuned on the same task with different random seeds. This intra-task fusion method aims to leverage the strengths of various fine-tuned models to create a more robust and effective adaptation. To validate our approach, we conducted comprehensive experiments across three key areas: mathematical reasoning, code generation, and general instruction-tuning tasks. The results demonstrate that our fusion method significantly enhances LoRA's performance, outperforming both standalone LoRA models and current fusion methods. Notably, this advancement substantially narrows the gap between LoRA and full fine-tuning, thus offering a more effective approach to model adaptation without the GPU memory burden of full parameter fine-tuning.

Paperid:839 poster

Authors:Edward Pearce-Crump

Title: Compact Matrix Quantum Group Equivariant Neural Networks

Abstract:
Group equivariant neural networks have proven effective in modelling a wide range of tasks where the data lives in a classical geometric space and exhibits well-defined group symmetries. However, these networks are not suitable for learning from data that lives in a non-commutative geometry, described formally by non-commutative $\mathcal{C}^{\ast}$-algebras, since the $\mathcal{C}^{\ast}$-algebra of continuous functions on a compact matrix group is commutative. To address this limitation, we derive the existence of a new type of equivariant neural network, called compact matrix quantum group equivariant neural networks, which encode symmetries that are described by compact matrix quantum groups. We characterise the weight matrices that appear in these neural networks for the easy compact matrix quantum groups, which are defined by set partitions. As a result, we obtain new characterisations of equivariant weight matrices for some compact matrix groups that have not appeared previously in the machine learning literature.

Paperid:840 poster

Authors:Qilin Liao, Shuo Yang, Bo Zhao, Ping Luo, Hengshuang Zhao

Title: BOOD: Boundary-based Out-Of-Distribution Data Generation

Abstract:
Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing out-of-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more training efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 29.64% decrease in average FPR95 (40.31% vs. 10.67%) and a 7.27% improvement in average AUROC (90.15% vs. 97.42%) on the Cifar-100 dataset.

Paperid:841 poster

Authors:Yuxin Zhou, zheng li, Jun Zhang, Jue WANG, Yiping Wang, Zhongle Xie, Ke Chen, Lidan Shou

Title: FloE: On-the-Fly MoE Inference on Memory-constrained GPU

Abstract:
With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert's internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE achieves a 9.3$\times$ compression of parameters per expert in Mixtral-8$\times$7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5$\times$; and delivers a 48.7$\times$ inference speedup compared to DeepSpeed-MII on a single GeForce RTX 3090—all with only a 4.4% $\sim$ 7.6% average performance degradation.

Paperid:842 poster

Authors:Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, J Swaroop Guntupalli, Wolfgang Lehrach, Miguel Lazaro-Gredilla, Kevin Patrick Murphy

Title: Improving Transformer World Models for Data-Efficient RL

Abstract:
We present an approach to model-based RL that achieves a new state of the art performance on the challenging Craftax-classic benchmark, an open-world 2D survival game that requires agents to exhibit a wide range of general abilities---such as strong generalization, deep exploration, and long-term reasoning. With a series of careful design choices aimed at improving sample efficiency, our MBRL algorithm achieves a reward of 69.66% after only 1M environment steps, significantly outperforming DreamerV3, which achieves53.2, and, for the first time, exceeds human performance of 65.0%. Our method starts by constructing a SOTA model-free baseline, using a novel policy architecture that combines CNNs and RNNs. We then add three improvements to the standard MBRL setup: (a) "Dyna with warmup", which trains the policy on real and imaginary data, (b) "nearest neighbor tokenizer" on image patches, which improves the scheme to create the transformer world model (TWM) inputs, and (c) "block teacher forcing", which allows the TWM to reason jointly about the future tokens of the next timestep.

Paperid:843 poster

Authors:Sungwon Kim, Namkyeong Lee, Yunyoung Doh, Seungmin Shin, Guimok Cho, Seung-Won Jeon, Sangkook Kim, Chanyoung Park

Title: Thickness-aware E(3)-Equivariant 3D Mesh Neural Networks

Abstract:
Mesh-based 3D static analysis methods have recently emerged as efficient alternatives to traditional computational numerical solvers, significantly reducing computational costs and runtime for various physics-based analyses. However, these methods primarily focus on surface topology and geometry, often overlooking the inherent thickness of real-world 3D objects, which exhibits high correlations and similar behavior between opposing surfaces. This limitation arises from the disconnected nature of these surfaces and the absence of internal edge connections within the mesh.In this work, we propose a novel framework, the Thickness-aware E(3)-Equivariant 3D Mesh Neural Network (T-EMNN), that effectively integrates the thickness of 3D objects while maintaining the computational efficiency of surface meshes. Additionally, we introduce data-driven coordinates that encode spatial information while preserving E(3)-equivariance or invariance properties, ensuring consistent and robust analysis. Evaluations on a real-world industrial dataset demonstrate the superior performance of T-EMNN in accurately predicting node-level 3D deformations, effectively capturing thickness effects while maintaining computational efficiency.

Paperid:844 poster

Authors:GuoguoAi, Guansong Pang, Hezhe Qiao, YuanGao, Hui Yan

Title: GrokFormer: Graph Fourier Kolmogorov-Arnold Transformers

Abstract:
Graph Transformers (GTs) have demonstrated remarkable performance in graph representation learning over popular graph neural networks (GNNs). However, self-attention, the core module of GTs, preserves only low-frequency signals in graph features, leading to ineffectiveness in capturing other important signals like high-frequency ones. Some recent GT models help alleviate this issue, but their flexibility and expressiveness are still limited since the filters they learn are fixed on predefined graph spectrum or spectral order. To tackle this challenge, we propose a Graph Fourier Kolmogorov-Arnold Transformer (GrokFormer), a novel GT model that learns highly expressive spectral filters with adaptive graph spectrum and spectral order through a Fourier series modeling over learnable activation functions. We demonstrate theoretically and empirically that the proposed GrokFormer filter offers better expressiveness than other spectral methods. Comprehensive experiments on 10 real-world node classification datasets across various domains, scales, and graph properties, as well as 5 graph classification datasets, show that GrokFormer outperforms state-of-the-art GTs and GNNs. Our code is available athttps://github.com/GGA23/GrokFormer.

Paperid:845 poster

Authors:Ping Yu, Weizhe Yuan, Olga Golovneva, Tianhao Wu, Sainbayar Sukhbaatar, Jason E Weston, Jing Xu

Title: R.I.P.: Better Models by Survival of the Fittest Prompts

Abstract:
Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, from 18th place to 6th overall in the leaderboard.

Paperid:846 poster

Authors:Karn Tiwari, Niladri Dutta, N M Anoop Krishnan, Prathosh AP

Title: Latent Mamba Operator for Partial Differential Equations

Abstract:
Neural operators have emerged as powerful data-driven frameworks for solving Partial Differential Equations (PDEs), offering significant speedups over numerical methods. However, existing neural operators struggle with scalability in high-dimensional spaces, incur high computational costs, and face challenges in capturing continuous and long-range dependencies in PDE dynamics. To address these limitations, we introduce the Latent Mamba Operator (LaMO), which integrates the efficiency of state-space models (SSMs) in latent space with the expressive power of kernel integral formulations in neural operators. We also establish a theoretical connection between state-space models (SSMs) and the kernel integral of neural operators. Extensive experiments across diverse PDE benchmarks on regular grids, structured meshes, and point clouds covering solid and fluid physics datasets, LaMOs achieve consistent state-of-the-art (SOTA) performance, with a 32.3% improvement over existing baselines in solution operator approximation, highlighting its efficacy in modeling complex PDEs solution.

Paperid:847 poster

Authors:Hao Yu, Weixuan Liang, KE LIANG, Suyuan Liu, Meng Liu, Xinwang Liu

Title: On the Adversarial Robustness of Multi-Kernel Clustering

Abstract:
Multi-kernel clustering (MKC) has emerged as a powerful method for capturing diverse data patterns, offering robust and generalized representations of data structures. However, the increasing deployment of MKC in real-world applications raises concerns about its vulnerability to adversarial perturbations. While adversarial robustness has been extensively studied in other domains, its impact on MKC remains largely unexplored. In this paper, we address the challenge of assessing the adversarial robustness of MKC methods in a black-box setting. Specifically, we proposeAdvMKC, a novel reinforcement-learning-based adversarial attack framework designed to inject imperceptible perturbations into data and mislead MKC methods. AdvMKC leverages proximal policy optimization with an advantage function to overcome the instability of clustering results during optimization. Additionally, it introduces a generator-clusterer framework, where a generator produces adversarial perturbations, and a clusterer approximates MKC behavior, significantly reducing computational overhead. We provide theoretical insights into the impact of adversarial perturbations on MKC and validate these findings through experiments. Evaluations across seven datasets and eleven MKC methods (seven traditional and four robust) demonstrate AdvMKC's effectiveness, robustness, and transferability.

Paperid:848 poster

Authors:Adrian Rodriguez-Munoz, Manel Baradad, Phillip Isola, Antonio Torralba

Title: Separating Knowledge and Perception with Procedural Data

Abstract:
We train representation models with procedural data only, and apply them on visual similarity, classification, and semantic segmentation tasks without further training by using visual memory---an explicit database of reference image embeddings. Unlike prior work on visual memory, our approach achieves full compartmentalization with respect to all real-world images while retaining strong performance. Compared to a model trained on Places, our procedural model performs within 1% on NIGHTS visual similarity, outperforms by 8% and 15% on CUB200 and Flowers102 fine-grained classification, and is within 10% on ImageNet-1K classification. It also demonstrates strong zero-shot segmentation, achieving an $R^2$ on COCO within 10% of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap.

Paperid:849 poster

Authors:Xuchuang Wang, Qirun Zeng, Jinhang Zuo, Xutong Liu, Mohammad Hajiesmaili, John C.S. Lui, Adam Wierman

Title: Fusing Reward and Dueling Feedback in Stochastic Bandits

Abstract:
This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.

Paperid:850 poster

Authors:Zhitong Xu, Da Long, Yiming Xu, Guang Yang, Shandian Zhe, Houman Owhadi

Title: Toward Efficient Kernel-Based Solvers for Nonlinear PDEs

Abstract:
We introduce a novel kernel learning framework toward efficiently solving nonlinear partial differential equations (PDEs). In contrast to the state-of-the-art kernel solver that embeds differential operators within kernels, posing challenges with a large number of collocation points, our approach eliminates these operators from the kernel. We model the solution using a standard kernel interpolation form and differentiate the interpolant to compute the derivatives. Our framework obviates the need for complex Gram matrix construction between solutions and their derivatives, allowing for a straightforward implementation and scalable computation. As an instance, we allocate the collocation points on a grid and adopt a product kernel, which yields a Kronecker product structure in the interpolation. This structure enables us to avoid computing the full Gram matrix, reducing costs and scaling efficiently to a large number of collocation points. We provide a proof of the convergence and rate analysis of our method under appropriate regularity assumptions. In numerical experiments, we demonstrate the advantages of our method in solving several benchmark PDEs.

Paperid:851 poster

Authors:Bihe Zhao, Pratyush Maini, Franziska Boenisch, Adam Dziedzic

Title: Unlocking Post-hoc Dataset Inference with Synthetic Data

Abstract:
The remarkable capabilities of Large Language Models (LLMs) can be mainly attributed to their massive training datasets, which are often scraped from the internet without respecting data owners�?intellectual property rights. Dataset Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training, thereby enabling data owners to verify unauthorized use. However, existing DI methods require a private set—known to be absent from training—that closely matches the compromised dataset’s distribution. Such in-distribution, held-out data is rarely available in practice, severely limiting the applicability of DI. In this work, we address this challenge by synthetically generating the required held-out set. Our approach tackles two key obstacles: (1) creating high-quality, diverse synthetic data that accurately reflects the original distribution, which we achieve via a data generator trained on a carefully designed suffix-based completion task, and (2) bridging likelihood gaps between real and synthetic data, which is realized through post-hoc calibration. Extensive experiments on diverse text datasets show that using our generated data as a held-out set enables DI to detect the original training sets with high confidence, while maintaining a low false positive rate. This result empowers copyright owners to make legitimate claims on data usage and demonstrates our method’s reliability for real-world litigations. Our code is available athttps://github.com/sprintml/PostHocDatasetInference.

Paperid:852 poster

Authors:Arwen Bradley, Preetum Nakkiran, David Berthelot, James Thornton, Joshua M. Susskind

Title: Mechanisms of Projective Composition of Diffusion Models

Abstract:
We study the theoretical foundations of composition in diffusion models, with a particular focus on out-of-distribution extrapolation and length-generalization. Prior work has shown that composing distributions via linear score combination can achieve promising results, including length-generalization in some cases (Du et al., 2023; Liu et al., 2022). However, our theoretical understanding of how and why such compositions work remains incomplete. In fact, it is not even entirely clear what it means for composition to "work". This paper starts to address these fundamental gaps. We begin by precisely defining one possible desired result of composition, which we callprojective composition. Then, we investigate: (1) when linear score combinations provably achieve projective composition, (2) whether reverse-diffusion sampling can generate the desired composition, and (3) the conditions under which composition fails. We connect our theoretical analysis to prior empirical observations where composition has either worked or failed, for reasons that were unclear at the time. Finally, we propose a simple heuristic to help predict the success or failure of new compositions.

Paperid:853 poster

Authors:Shuai Wang, Zexian Li, Qipeng zhang, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang

Title: Differentiable Solver Search for Fast Diffusion Sampling

Abstract:
Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet-256×256with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

Paperid:854 poster

Authors:FENGYUN WANG, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun

Title: 3D Question Answering via only 2D Vision-Language Models

Abstract:
Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.

Paperid:855 poster

Authors:Franck SIGNE TALLA, Edouard Grave, Herve Jegou

Title: Neutral residues: revisiting adapters for model extension

Abstract:
We address the problem of extending a pre-trained large language model to a new domain that was not seen during training. Standard techniques, such as fine-tuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain. Here, we propose to revisit and improve adapters to extend LLMs. Our paper analyzes this extension problem from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads to each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperforms competing approaches such as fine-tuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.

Paperid:856 poster

Authors:Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, Bo Li

Title: AdvAgent: Controllable Blackbox Red-teaming on Web Agents

Abstract:
Foundation model-based agents are increasingly used to automate complex tasks, enhancing efficiency and productivity. However, their access to sensitive resources and autonomous decision-making also introduce significant security risks, where successful attacks could lead to severe consequences. To systematically uncover these vulnerabilities, we propose AdvAgent, a black-box red-teaming framework for attacking web agents. Unlike existing approaches, AdvAgent employs a reinforcement learning-based pipeline to train an adversarial prompter model that optimizes adversarial prompts using feedback from the black-box agent. With careful attack design, these prompts effectively exploit agent weaknesses while maintaining stealthiness and controllability. Extensive evaluations demonstrate that AdvAgent achieves high success rates against state-of-the-art GPT-4-based web agents across diverse web tasks. Furthermore, we find that existing prompt-based defenses provide only limited protection, leaving agents vulnerable to our framework. These findings highlight critical vulnerabilities in current web agents and emphasize the urgent need for stronger defense mechanisms. We release code athttps://ai-secure.github.io/AdvAgent/.

Paperid:857 poster

Authors:Seokhun Park, Insung Kong, yongchan Choi, Chanmoo Park, Yongdai Kim

Title: Tensor Product Neural Networks for Functional ANOVA Model

Abstract:
Interpretability for machine learning models is becoming more and more important as machine learning models become more complex. The functional ANOVA model, which decomposes a high-dimensional function into a sum of lower dimensional functions (commonly referred to as components), is one of the most popular tools for interpretable AI, and recently, various neural networks have been developed for estimating each component in the functional ANOVA model. However, such neural networks are highly unstable when estimating each component since the components themselves are not uniquely defined. That is, there are multiple functional ANOVA decompositions for a given function. In this paper, we propose a novel neural network which guarantees a unique functional ANOVA decomposition and thus is able to estimate each component stably. We call our proposed neural network ANOVA Tensor Product Neural Network (ANOVA-TPNN) since it is motivated by the tensor product basis expansion. Theoretically, we prove that ANOVA-TPNN can approximate any smooth function well. Empirically, we show that ANOVA-TPNN provide much more stable estimation of each component and thus much more stable interpretation when training data and initial values of the model parameters vary than existing neural networks do. Our source code is released athttps://github.com/ParkSeokhun/ANOVA-TPNN

Paperid:858 poster

Authors:Benson Chen, Tomasz Danel, Gabriel H. S. Dreiman, Patrick J. McEnaney, Nikhil Jain, Kirill Novikov, Spurti Umesh Akki, Joshua L. Turnbull, Virja Atul Pandya, Boris P. Belotserkovskii, Jared Bryce Weaver, Ankita Biswas, Dat Nguyen, Kent Gorday, Mohammad Sultan, Nathaniel Stanley, Daniel M Whalen, Divya Kanichar, Christoph Klein, Emily Fox, et al. (1 additional authors not shown)

Title: KinDEL: DNA-Encoded Library Dataset for Kinase Inhibitors

Abstract:
DNA-Encoded Libraries (DELs) represent a transformative technology in drug discovery, facilitating the high-throughput exploration of vast chemical spaces. Despite their potential, the scarcity of publicly available DEL datasets presents a bottleneck for the advancement of machine learning methodologies in this domain. To address this gap, we introduce KinDEL, one of the largest publicly accessible DEL datasets and the first one that includes binding poses from molecular docking experiments. Focused on two kinases, Mitogen-Activated Protein Kinase 14 (MAPK14) and Discoidin Domain Receptor Tyrosine Kinase 1 (DDR1), KinDEL includes 81 million compounds, offering a rich resource for computational exploration. Additionally, we provide comprehensive biophysical assay validation data, encompassing both on-DNA and off-DNA measurements, which we use to evaluate a suite of machine learning techniques, including novel structure-based probabilistic models. We hope that our benchmark, encompassing both 2D and 3D structures, will help advance the development of machine learning models for data-driven hit identification using DELs.

Paperid:859 poster

Authors:Xinyue Zeng, Haohui Wang, Junhong Lin, Jun Wu, Tyler Cody, Dawei Zhou

Title: LensLLM: Unveiling Fine-Tuning Dynamics for LLM Selection

Abstract:
The proliferation of open-sourced Large Language Models (LLMs) and diverse downstream tasks necessitates efficient model selection, given the impracticality of fine-tuning all candidates due to computational constraints. Despite the recent advances in LLM selection, a fundamental research question largely remains nascent:how can we model the dynamic behaviors of LLMs during fine-tuning, thereby enhancing our understanding of their generalization performance across diverse downstream tasks?In this work, we propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs, thereby enabling accurate and efficient LLM selection for downstream applications. In particular, we first derive aPAC-Bayesian Generalization Boundthat unveils fine-tuning dynamics of LLMs and then introduceLensLLM, a Neural Tangent Kernel (NTK)-based Rectified Scaling Model that enables accurate performance predictions across diverse tasks while maintaining computational efficiency. Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposedLensLLMmodel and corresponding results atLensLLM.io.

Paperid:860 poster

Authors:Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie

Title: What If We Recaption Billions of Web Images with LLaMA-3?

Abstract:
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and $\textit{open-sourced}$ LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption ~1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe an average of 3.1% enhanced zero-shot performance cross four cross-modal retrieval tasks using a mixed set of the original and our captions. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page ishttps://www.haqtu.me/Recap-Datacomp-1B/.

Paperid:861 poster

Authors:Raja Sunkara, Ardhendu Tripathy

Title: Adaptive Partitioning Schemes for Optimistic Optimization

Abstract:
Applications such as engineering design often require us to optimize a black-box function, i.e., a system whose inner processing is not analytically known and whose gradients are not available. Practitioners often have a fixed budget for the number of function evaluations and the performance of an optimization algorithm is measured by its simple regret. In this paper, we study the class of ``Optimistic Optimization'' algorithms for black-box optimization that use a partitioning scheme for the domain. We develop algorithms that learn a good partitioning scheme and use flexible surrogate models such as neural networks in the optimization procedure. For multi-index functions on an $m$-dimensional subspace within $d$ dimensions, our algorithm attains $\tilde{O}(n^{-\beta / d})$ regret, where $\beta = 1 + \frac{d-m}{2m-1}$, as opposed to $\tilde{O}(n^{-1/d})$ for SequOOL, a state-of-the-art optimistic optimization algorithm. Our approach is competitive across a wide range of numerical benchmarks. Additionally, we introduce weight quantization in a large language model as a novel task for black-box optimization. Our approach improves the quality of Activation-aware Weight Quantization (AWQ) of the OPT-1.3B model, achieving an approximate 10% improvement in performance relative to the best possible unquantized model.

Paperid:862 poster

Authors:Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian

Title: AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Abstract:
Large Language Models (LLMs) are vulnerable tojailbreaking attacksthat lead to generation of inappropriate or harmful content. Manual red-teaming requires a time-consuming search for adversarial prompts, whereas automatic adversarial prompt generation often leads to semantically meaningless attacks that do not scale well. In this paper, we present a novel method that uses another LLM, calledAdvPrompter, to generate human-readable adversarial prompts in seconds. AdvPrompter, which is trained using an alternating optimization algorithm, generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLM show highly competitive results on the AdvBench and HarmBench datasets, that also transfer to closed-source black-box LLMs. We also show that training on adversarial suffixes generated by AdvPrompter is a promising strategy for improving the robustness of LLMs to jailbreaking attacks.

Paperid:863 poster

Authors:Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Sihan Liu, Thanasis Pittas

Title: Batch List-Decodable Linear Regression via Higher Moments

Abstract:
We study the task of list-decodable linear regression using batches, recently introduced by Das et al. 2023.. In this setting, we are given $m$ batches with each batch containing $n$ points in $\mathbb R^d$. A batch is called clean if the points it contains are i.i.d. samples from an unknown linear regression distribution. For a parameter $\alpha \in (0, 1/2)$, an unknown $\alpha$-fraction of the batches are clean and no assumptions are made on the remaining batches. The goal is to output a small list of vectors at least one of which is close to the true regressor vector in $\ell_2$-norm. Das et al. 2023 gave an efficient algorithm for this task, under natural distributional assumptions, with the following guarantee. Under the assumption that the batch size satisfies $n \geq \tilde{\Omega}(\alpha^{-1})$ and the total number of batches is $m = \text{poly}(d, n, 1/\alpha)$, their algorithm runs in polynomial time and outputs a list of $O(1/\alpha^2)$ vectors at least one of which is $\tilde{O}(\alpha^{-1/2}/\sqrt{n})$ close to the target regressor. Here we design a new polynomial-time algorithm for this task with significantly stronger guarantees under the assumption that the low-degree moments of the covariates distribution are Sum-of-Squares (SoS) certifiably bounded. Specifically, for any constant $\delta>0$, as long as the batch size is $n \geq \Omega_{\delta}(\alpha^{-\delta})$ and the degree-$\Theta(1/\delta)$ moments of the covariates are SoS certifiably bounded, our algorithm uses $m = \text{poly}((dn)^{1/\delta}, 1/\alpha)$ batches, runs in polynomial-time, and outputs an $O(1/\alpha)$-sized list of vectors one of which is $O(\alpha^{-\delta/2}/\sqrt{n})$ close to the target. That is, our algorithm substantially improves both the minimum batch size and the final error guarantee, while achieving the optimal list size. Our approach leverages higher-order moment information by carefully combining the SoS paradigm interleaved with an iterative method and a novel list pruning procedure for this setting. In the process, we give an SoS proof of the Marcinkiewicz-Zygmund inequality that may be of broader applicability.

Paperid:864 poster

Authors:Benedikt Böck, Andreas Oeldemann, Timo Mayer, Francesco Rossetto, Wolfgang Utschick

Title: Physics-Informed Generative Modeling of Wireless Channels

Abstract:
Learning the site-specific distribution of the wireless channel within a particular environment of interest is essential to exploit the full potential of machine learning (ML) for wireless communications and radar applications. Generative modeling offers a promising framework to address this problem. However, existing approaches pose unresolved challenges, including the need for high-quality training data, limited generalizability, and a lack of physical interpretability. To address these issues, we combine the physics-related compressibility of wireless channels with generative modeling, in particular, sparse Bayesian generative modeling (SBGM), to learn the distribution of the underlying physical channel parameters. By leveraging the sparsity-inducing characteristics of SBGM, our methods can learn from compressed observations received by an access point (AP) during default online operation. Moreover, they are physically interpretable and generalize over system configurations without requiring retraining.

Paperid:865 poster

Authors:Hoang V. Tran, Huyen Trang Pham, Tho Tran Huu, Minh-Khoi Nguyen-Nhat, Thanh Chu, Tam Le, Tan Minh Nguyen

Title: Tree-Sliced Wasserstein Distance: A Geometric Perspective

Abstract:
Many variants of Optimal Transport (OT) have been developed to address its heavy computation. Among them, notably, Sliced Wasserstein (SW) is widely used for application domains by projecting the OT problem onto one-dimensional lines, and leveraging the closed-form expression of the univariate OT to reduce the computational burden. However, projecting measures onto low-dimensional spaces can lead to a loss of topological information. To mitigate this issue, in this work, we propose to replace one-dimensional lines with a more intricate structure, called \emph{tree systems}. This structure is metrizable by a tree metric, which yields a closed-form expression for OT problems on tree systems. We provide an extensive theoretical analysis to formally define tree systems with their topological properties, introduce the concept of splitting maps, which operate as the projection mechanism onto these structures, then finally propose a novel variant of Radon transform for tree systems and verify its injectivity. This framework leads to an efficient metric between measures, termed Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL). By conducting a variety of experiments on gradient flows, image style transfer, and generative models, we illustrate that our proposed approach performs favorably compared to SW and its variants.

Paperid:866 poster

Authors:Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov

Title: Revisiting Non-Acyclic GFlowNets in Discrete Environments

Abstract:
Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects from a given probability distribution, potentially known up to a normalizing constant. Instead of working in the object space, GFlowNets proceed by sampling trajectories in an appropriately constructed directed acyclic graph environment, greatly relying on the acyclicity of the graph. In our paper, we revisit the theory that relaxes the acyclicity assumption and present a simpler theoretical framework for non-acyclic GFlowNets in discrete environments. Moreover, we provide various novel theoretical insights related to training with fixed backward policies, the nature of flow functions, and connections between entropy-regularized RL and non-acyclic GFlowNets, which naturally generalize the respective concepts and theoretical results from the acyclic setting. In addition, we experimentally re-examine the concept of loss stability in non-acyclic GFlowNet training, as well as validate our own theoretical findings.

Paperid:867 poster

Authors:Tao Li, Zhengbao He, Yujun Li, Yasheng Wang, Lifeng Shang, Xiaolin Huang

Title: Flat-LoRA: Low-Rank Adaptation over a Flat Loss Landscape

Abstract:
Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computation and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, offers an efficient solution by optimizing only low-rank matrices. Despite recent progress in improving LoRA's performance, the relationship between the LoRA optimization space and the full parameter space is often overlooked. A solution that appears flat in the loss landscape of the LoRA space may still exhibit sharp directions in the full parameter space, potentially compromising generalization. We introduce Flat-LoRA, which aims to identify a low-rank adaptation situated in a flat region of the full parameter space. Instead of adopting the well-established sharpness-aware minimization approach, which incurs significant computation and memory overheads, we employ a Bayesian expectation loss objective to preserve training efficiency. Further, we design a refined strategy for generating random perturbations to enhance performance and carefully manage memory overhead using random seeds. Experiments across diverse tasks—including mathematical reasoning, coding abilities, dialogue generation, instruction following, and text-to-image generation—demonstrate that Flat-LoRA improves both in-domain and out-of-domain generalization. Code is available athttps://github.com/nblt/Flat-LoRA.

Paperid:868 poster

Authors:Kun Wang, Sumanth Varambally, Duncan Watson-Parris, Yian Ma, Rose Yu

Title: Discovering Latent Causal Graphs from Spatiotemporal Data

Abstract:
Many important phenomena in scientific fields like climate, neuroscience, and epidemiology are naturally represented as spatiotemporal gridded data with complex interactions. Inferring causal relationships from these data is a challenging problem compounded by the high dimensionality of such data and the correlations between spatially proximate points. We present SPACY (SPAtiotemporal Causal discoverY), a novel framework based on variational inference, designed to model latent time series and their causal relationships from spatiotemporal data. SPACY alleviates the high-dimensional challenge by discovering causal structures in the latent space. To aggregate spatially proximate, correlated grid points, we use spatial factors, parametrized by spatial kernel functions, to map observational time series to latent representations. Theoretically, we generalize the problem to a continuous spatial domain and establish identifiability when the observations arise from a nonlinear, invertible function of the product of latent series and spatial factors. Using this approach, we avoid assumptions that are often unverifiable, including those about instantaneous effects or sufficient variability. Empirically, SPACY outperforms state-of-the-art baselines on synthetic data, even in challenging settings where existing methods struggle, while remaining scalable for large grids. SPACY also identifies key known phenomena from real-world climate data. An implementation of SPACY is available at \url{https://github.com/Rose-STL-Lab/SPACY/}

Paperid:869 poster

Authors:Yijiang Li, Genpei Zhang, Jiacheng Cheng, Yi Li, Xiaojun Shan, Dashan Gao, Jiancheng Lyu, Yuan Li, Ning Bi, Nuno Vasconcelos

Title: EgoPrivacy: What Your First-Person Camera Says About You?

Abstract:
While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale benchmark for the comprehensive evaluation of privacy risks in egocentric vision. EgoPrivacy covers three types of privacy (demographic, individual, and situational), defining seven tasks that aim to recover private information ranging from fine-grained (e.g., wearer's identity) to coarse-grained (e.g., age group). To further emphasize the privacy threats inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel attack strategy that leverages ego-to-exo retrieval from an external pool of exocentric videos to boost the effectiveness of demographic privacy attacks. An extensive comparison of the different attacks possible under all threat models is presented, showing that private information of the wearer is highly susceptible to leakage. For instance, our findings indicate that foundation models can effectively compromise wearer privacy even in zero-shot settings by recovering attributes such as identity, scene, gender, and race with 70�?0% accuracy. Our code and data are available athttps://github.com/williamium3000/ego-privacy.

Paperid:870 poster

Authors:Lukas Lao Beyer, Tianhong Li, Xinlei Chen, Sertac Karaman, Kaiming He

Title: Highly Compressed Tokenizer Can Generate Without Training

Abstract:
Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called1Dimage tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

Paperid:871 poster

Authors:Wen Wang, RuiBing Hou, Hong Chang, Shiguang Shan, Xilin Chen

Title: MATS: An Audio Language Model under Text-only Supervision

Abstract:
Large audio-language models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities. However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we proposeMATS, an audio-language multimodal LLM designed to handleMultipleAudio task using solelyText-onlySupervision. By leveraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose theStrongly-relatednoisytext withaudio (Santa) mechanism. Santa maps audio embeddings into CLAP language embedding space while preserving essential information from the audio input. Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs. The code is publicly available inhttps://github.com/wangwen-banban/MATS

Paperid:872 poster

Authors:Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin DURMUS, Spandana Gella, Karolina Stanczak, Siva Reddy

Title: SafeArena: Evaluating the Safety of Autonomous Web Agents

Abstract:
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, a benchmark focused on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories---misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents.

Paperid:873 poster

Authors:Yuta Tarumi, Keisuke Fukuda, Shin-ichi Maeda

Title: Deep Bayesian Filter for Bayes-Faithful Data Assimilation

Abstract:
Data assimilation for nonlinear state space models (SSMs) is inherently challenging due to non-Gaussian posteriors. We propose Deep Bayesian Filtering (DBF), a novel approach to data assimilation in nonlinear SSMs. DBF introduces latent variables $h_t$ in addition to physical variables $z_t$, ensuring Gaussian posteriors by (i) constraining state transitions in the latent space to be linear and (ii) learning a Gaussian inverse observation operator $r(h_t|o_t)$. This structured posterior design enables analytical recursive computation, avoiding the accumulation of Monte Carlo sampling errors over time steps. DBF optimizes these operators and other latent SSM parameters by maximizing the evidence lower bound. Experiments demonstrate that DBF outperforms existing methods in scenarios with highly non-Gaussian posteriors.

Paperid:874 poster

Authors:Francis Bach, Saeed Saremi

Title: Sampling Binary Data by Denoising through Score Functions

Abstract:
Gaussian smoothing combined with a probabilistic framework for denoising via the empirical Bayes formalism, i.e., the Tweedie-Miyasawa formula (TMF), are the two key ingredients in the success of score-based generative models in Euclidean spaces. Smoothing holds the key for easing the problem of learning and sampling in high dimensions, denoising is needed for recovering the original signal, and TMF ties these together via the score function of noisy data. In this work, we extend this paradigm to the problem of learning and sampling the distribution of binary data on the Boolean hypercube by adopting Bernoulli noise, instead of Gaussian noise, as a smoothing device. We first derive a TMF-like expression for the optimal denoiser for the Hamming loss, where a score function naturally appears. Sampling noisy binary data is then achieved using a Langevin-like sampler which we theoretically analyze for different noise levels. At high Bernoulli noise levels sampling becomes easy, akin to log-concave sampling in Euclidean spaces. In addition, we extend the sequential multi-measurement sampling of Saremi et al. (2024) to the binary setting where we can bring the "effective noise" down by sampling multiple noisy measurements at a fixed noise level, without the need for continuous-time stochastic processes. We validate our formalism and theoretical findings by experiments on synthetic data and binarized images.

Paperid:875 poster

Authors:Leyan Pan, Vijay Ganesh, Jacob Abernethy, Chris Esposo, Wenke Lee

Title: Can Transformers Reason Logically? A Study in SAT Solving

Abstract:
We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT). Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties. Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (``reasoning paths'') from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result.

Paperid:876 poster

Authors:Yi Yu, Song Xia, SIYUAN YANG, Chenqi Kong, Wenhan Yang, Shijian Lu, Yap-Peng Tan, Alex Kot

Title: MTL-UE: Learning to Learn Nothing for Multi-Task Learning

Abstract:
Most existing unlearnable strategies focus on preventing unauthorized users from training single-task learning (STL) models with personal data. Nevertheless, the paradigm has recently shifted towards multi-task data and multi-task learning (MTL), targeting generalist and foundation models that can handle multiple tasks simultaneously. Despite their growing importance, MTL data and models have been largely neglected while pursuing unlearnable strategies. This paper presents MTL-UE, the first unified framework for generating unlearnable examples for multi-task data and MTL models. Instead of optimizing perturbations for each sample, we design a generator-based structure that introduces label priors and class-wise feature embeddings which leads to much better attacking performance. In addition, MTL-UE incorporates intra-task and inter-task embedding regularization to increase inter-class separation and suppress intra-class variance which enhances the attack robustness greatly. Furthermore, MTL-UE is versatile with good supports for dense prediction tasks in MTL. It is also plug-and-play allowing integrating existing surrogate-dependent unlearnable methods with little adaptation. Extensive experiments show that MTL-UE achieves superior attacking performance consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5 MTL task-weighting strategies. Code is available athttps://github.com/yuyi-sd/MTL-UE.

Paperid:877 poster

Authors:Zikui Cai, Yaoteng Tan, M. Salman Asif

Title: Targeted Unlearning with Single Layer Unlearning Gradient

Abstract:
Machine unlearning methods aim to remove sensitive or unwanted content from trained models, but typically demand extensive model updates at significant computational cost while potentially degrading model performance on both related and unrelated tasks. We propose Single Layer Unlearning Gradient (SLUG) as an efficient method to unlearn targeted information by updating a single critical layer using a one-time gradient computation. SLUG uses layer importance and gradient alignment metrics to identify the optimal layer for targeted information removal while preserving the model utility. We demonstrate the effectiveness of SLUG for CLIP, Stable Diffusion, and vision-language models (VLMs) in removing concrete (e.g., identities and objects) and abstract concepts (e.g., artistic styles). On the UnlearnCanvas benchmark, SLUG achieves comparable unlearning performance to existing methods while requiring significantly less computational resources. Our proposed approach offers a practical solution for targeted unlearning that is computationally efficient and precise. Our code is available athttps://github.com/CSIPlab/SLUG

Paperid:878 poster

Authors:Leonardo Hernandez Cano, Maxine Perroni-Scharf, Neil Dhir, Arun Ramamurthy, Armando Solar-Lezama

Title: Neurosymbolic World Models for Sequential Decision Making

Abstract:
We present Structured World Modeling for Policy Optimization (SWMPO), a framework for unsupervised learning of neurosymbolic Finite State Machines (FSM) that capture environmental structure for policy optimization. Traditional unsupervised world modeling methods rely on unstructured representations, such as neural networks, that do not explicitly represent high-level patterns within the system (e.g., patterns in the dynamics of regions such as \emph{water} and \emph{land}). Instead, SWMPO models the environment as a finite state machine (FSM), where each state corresponds to a specific region with distinct dynamics. This structured representation can then be leveraged for tasks like policy optimization. Previous works that synthesize FSMs for this purpose have been limited to discrete spaces, not continuous spaces. Instead, our proposed FSM synthesis algorithm operates in an unsupervised manner, leveraging low-level features from unprocessed, non-visual data, making it adaptable across various domains. The synthesized FSM models are expressive enough to be used in a model-based Reinforcement Learning scheme that leverages offline data to efficiently synthesize environment-specific world models. We demonstrate the advantages of SWMPO by benchmarking its environment modeling capabilities in simulated environments.

Paperid:879 poster

Authors:Wenbin Wang, Yu Shi, Ziping Zhao

Title: High-Dimensional Tensor Regression With Oracle Properties

Abstract:
The emergence of multi-dimensional data presents significant challenges for traditional regression models based on matrices or vectors, particularly in capturing multi-directional correlations. In response, tensor regression has been proposed as a powerful framework for modeling linear relationships among multi-dimensional variables. In this paper, we introduce a high-dimensional tensor-response tensor regression model under low-dimensional structural assumptions, such as sparsity and low-rankness. Assuming the underlying tensor lies within an unknown low-dimensional subspace, we consider a least squares estimation framework with non-convex penalties. Theoretically, we derive general risk bounds for the resulting estimators and demonstrate that they achieve the oracle statistical rates under mild technical conditions. To compute the proposed estimators efficiently, we introduce an accelerated proximal gradient algorithm demonstrating rapid convergence in practice. Extensive experiments on synthetic and real-world datasets validate the effectiveness of the proposed regression model and showcase the practical utility of the theoretical findings.

Paperid:880 poster

Authors:L. Elisa Celis, Lingxiao Huang, Nisheeth K. Vishnoi

Title: A Mathematical Framework for AI-Human Integration in Work

Abstract:
The rapid rise of Generative AI (GenAI) tools has sparked debate over their role in complementing or replacing human workers across job contexts. We present a mathematical framework that models jobs, workers, and worker-job fit, introducing a novel decomposition of skills into decision-level and action-level subskills to reflect the complementary strengths of humans and GenAI. We analyze how changes in subskill abilities affect job success, identifying conditions for sharp transitions in success probability. We also establish sufficient conditions under which combining workers with complementary subskills significantly outperforms relying on a single worker. This explains phenomena such asproductivity compression, where GenAI assistance yields larger gains for lower-skilled workers. We demonstrate the framework's practicality using data from O*NET and Big-Bench Lite, aligning real-world data with our model via subskill-division methods. Our results highlight when and how GenAI complements human skills, rather than replacing them.

Paperid:881 poster

Authors:Jinda Lu, Junkang Wu, Jinghan Li, Xiaojun Jia, Shuo Wang, YiFan Zhang, Junfeng Fang, Xiang Wang, Xiangnan He

Title: DAMA: Data- and Model-aware Alignment of Multi-modal LLMs

Abstract:
Direct Preference Optimization (DPO) has shown effectiveness in aligning multi-modal large language models (MLLM) with human preferences. However, existing methods exhibit an imbalanced responsiveness to the data of varying hardness, tending to overfit on the easy-to-distinguish data while underfitting on the hard-to-distinguish data. In this paper, we propose Data- and Model-aware DPO (DAMA) to dynamically adjust the optimization process from two key aspects: (1) a data-aware strategy that incorporates data hardness, and (2) a model-aware strategy that integrates real-time model responses. By combining the two strategies, DAMA enables the model to effectively adapt to data with varying levels of hardness. Extensive experiments on five benchmarks demonstrate that DAMA not only significantly enhances the trustworthiness, but also improves the effectiveness over general tasks. For instance, on the Object HalBench, our DAMA-7B reduces response-level and mentioned-level hallucination by 90.0% and 95.3%, respectively, surpassing the performance of GPT-4V.

Paperid:882 poster

Authors:Ming-Yi Hong, Yen-Jung Hsu, Miao-Chen Chiang, Che Lin

Title: MTSTRec: Multimodal Time-Aligned Shared Token Recommender

Abstract:
Sequential recommendation in e-commerce utilizes users' anonymous browsing histories to personalize product suggestions without relying on private information. Existing item ID-based methods and multimodal models often overlook the temporal alignment of modalities like textual descriptions, visual content, and prices in user browsing sequences. To address this limitation, this paper proposes the Multimodal Time-aligned Shared Token Recommender (MTSTRec), a transformer-based framework with a single time-aligned shared token per product for efficient cross-modality fusion. MTSTRec preserves the distinct contributions of each modality while aligning them temporally to better capture user preferences. Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion. Our code is available athttps://github.com/idssplab/MTSTRec.

Paperid:883 poster

Authors:Atsutoshi Kumagai, Tomoharu Iwata, Hiroshi Takahashi, Taishi Nishiyama, Kazuki Adachi, Yasuhiro Fujiwara

Title: Positive-unlabeled AUC Maximization under Covariate Shift

Abstract:
Maximizing the area under the receiver operating characteristic curve (AUC) is a standard approach to imbalanced binary classification tasks. Existing AUC maximization methods typically assume that training and test distributions are identical. However, this assumption is often violated due to {\it a covariate shift}, where the input distribution can vary but the conditional distribution of the class label given the input remains unchanged. The importance weighting is a common approach to the covariate shift, which minimizes the test risk with importance-weighted training data. However, it cannot maximize the AUC. In this paper, to achieve this, we theoretically derive two estimators of the test AUC risk under the covariate shift by using positive and unlabeled (PU) data in the training distribution and unlabeled data in the test distribution. Our first estimator is calculated from importance-weighted PU data in the training distribution, and the second one is calculated from importance-weighted positive data in the training distribution and unlabeled data in the test distribution. We train classifiers by minimizing a weighted sum of the two AUC risk estimators that approximates the test AUC risk. Unlike the existing importance weighting, our method does not require negative labels and class-priors. We show the effectiveness of our method with six real-world datasets.

Paperid:884 poster

Authors:Qitian Wu, Chenxiao Yang, Kaipeng Zeng, Michael M. Bronstein

Title: Supercharging Graph Transformers with Advective Diffusion

Abstract:
The capability of generalization is a cornerstone for the success of modern learning systems. For non-Euclidean data, e.g., graphs, that particularly involves topological structures, one important aspect neglected by prior studies is how machine learning models generalize under topological shifts. This paper proposes AdvDIFFormer, a physics-inspired graph Transformer model designed to address this challenge. The model is derived from advective diffusion equations which describe a class of continuous message passing process with observed and latent topological structures. We show that AdvDIFFormer has provable capability for controlling generalization error with topological shifts, which in contrast cannot be guaranteed by graph diffusion models, i.e., the generalization of common graph neural networks in continuous space. Empirically, the model demonstrates superiority in various predictive tasks across information networks, molecular screening and protein interactions

Paperid:885 poster

Authors:Kaiqiang Ke, Qian Lin, Zongkai Liu, Shenghong He, Chao Yu

Title: Conservative Offline Goal-Conditioned Implicit V-Learning

Abstract:
Offline goal-conditioned reinforcement learning (GCRL) learns a goal-conditioned value function to train policies for diverse goals with pre-collected datasets. Hindsight experience replay addresses the issue of sparse rewards by treating intermediate states as goals but fails to complete goal-stitching tasks where achieving goals requires stitching different trajectories. While cross-trajectory sampling is a potential solution that associates states and goals belonging to different trajectories, we demonstrate that this direct method degrades performance in goal-conditioned tasks due to the overestimation of values on unconnected pairs. To this end, we propose Conservative Goal-Conditioned Implicit Value Learning (CGCIVL), a novel algorithm that introduces a penalty term to penalize value estimation for unconnected state-goal pairs and leverages the quasimetric framework to accurately estimate values for connected pairs. Evaluations on OGBench, a benchmark for offline GCRL, demonstrate that CGCIVL consistently surpasses state-of-the-art methods across diverse tasks.

Paperid:886 poster

Authors:Kairong Yu, Tianqing Zhang, Qi Xu, Gang Pan, Hongwei Wang

Title: TS-SNN: Temporal Shift Module for Spiking Neural Networks

Abstract:
Spiking Neural Networks (SNNs) are increasingly recognized for their biological plausibility and energy efficiency, positioning them as strong alternatives to Artificial Neural Networks (ANNs) in neuromorphic computing applications. SNNs inherently process temporal information by leveraging the precise timing of spikes, but balancing temporal feature utilization with low energy consumption remains a challenge. In this work, we introduce Temporal Shift module for Spiking Neural Networks (TS-SNN), which incorporates a novel Temporal Shift (TS) module to integrate past, present, and future spike features within a single timestep via a simple yet effective shift operation. A residual combination method prevents information loss by integrating shifted and original features. The TS module is lightweight, requiring only one additional learnable parameter, and can be seamlessly integrated into existing architectures with minimal additional computational cost. TS-SNN achieves state-of-the-art performance on benchmarks like CIFAR-10 (96.72%), CIFAR-100 (80.28%), and ImageNet (70.61%) with fewer timesteps, while maintaining low energy consumption. This work marks a significant step forward in developing efficient and accurate SNN architectures.

Paperid:887 poster

Authors:Chi Gao, Lukai Li, Yancheng Zhou, Shangqi Guo

Title: Complete-Tree Space Favors Data-Efficient Link Prediction

Abstract:
Link prediction is a fundamental problem for network-structured data. However, the prevalent research paradigm tends to assume abundant observed links, overlooking more challenging scenarios with a scarcity of observed links, which results in insufficient sample sizes. In real-world networks, hierarchical modularity, characterized by structured and nested connections, remains robust even with sparsely observed links. To address the challenge of limited link samples, we propose leveraging hierarchical modularity as a prior structure. We introduce complete-tree (CT) space, a discrete metric space with latent complete-tree structures, to formalize hierarchical modularity with an emphasis on its hierarchical permutation symmetry. Utilizing the group theory to quantize and compare permutation symmetries of different spaces, we prove that the CT space provides a significantly lower bound on sample complexity than the commonly used Euclidean space. We develop leaf matching, a data-efficient network embedding that maps nodes onto the CT space and conducts discrete optimization by reducing it to decentralized search. Experiments verify the data efficiency of CT space over other spaces. Moreover, leaf matching outperforms the state-of-the-art graph transformer in data-scarce scenarios while exhibiting excellent scalability. The code is available at:https://github.com/KevinGao7/LeafMatching.

Paperid:888 poster

Authors:Chungpa Lee, Sehee Lim, Kibok Lee, Jy-yong Sohn

Title: On the Similarities of Embeddings in Contrastive Learning

Abstract:
Contrastive learning (CL) operates on a simple yet effective principle: embeddings of positive pairs are pulled together, while those of negative pairs are pushed apart. Although various forms of contrastive loss have been proposed and analyzed from different perspectives, prior works lack a comprehensive framework that systematically explains a broad class of these objectives. In this paper, we present a unified framework for understanding CL, which is based on analyzing the cosine similarity between embeddings of positive and negative pairs. In full-batch settings, we show that perfect alignment of positive pairs is unattainable when similarities of negative pairs fall below a certain threshold, and that this misalignment can be alleviated by incorporating within-view negative pairs. In mini-batch settings, we demonstrate that smaller batch sizes incur stronger separation among negative pairs within batches, which leads to higher variance in similarities of negative pairs. To address this limitation of mini-batch CL, we introduce an auxiliary loss term that reduces the variance of similarities of negative pairs in CL. Empirical results demonstrate that incorporating the proposed loss consistently improves the performance of CL methods in small-batch training.

Paperid:889 poster

Authors:Huadai Liu, Tianyi Luo, Kaicheng Luo, Qikai Jiang, Peiwen Sun, Jialei Wang, Rongjie Huang, Qian Chen, Wen Wang, Xiangtai Li, ShiLiang Zhang, Zhijie Yan, Zhou Zhao, Wei Xue

Title: OmniAudio: Generating Spatial Audio from 360-Degree Video

Abstract:
Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, \textbf{360V2SA}, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create \textbf{Sphere360}, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework \textbf{OmniAudio}, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at~\href{https://github.com/liuhuadai/OmniAudio}{\texttt{github.com/liuhuadai/OmniAudio}}. The project website is available at \href{https://OmniAudio-360V2SA.github.io}{\texttt{OmniAudio-360V2SA.github.io}}.

Paperid:890 poster

Authors:Soumya Basu

Title: Competing Bandits in Matching Markets via Super Stability

Abstract:
We study bandit learning in matching markets with two-sided reward uncertainty, extending prior research primarily focused on single-sided uncertainty. Leveraging the concept of `super-stability' from Irving (1994), we demonstrate the advantage of the Extended Gale-Shapley (GS) algorithm over the standard GS algorithm in achieving true stable matchings under incomplete information. By employing the Extended GS algorithm, our centralized algorithm attains a logarithmic pessimal stable regret dependent on an instance-dependent admissible gap parameter. This algorithm is further adapted to a decentralized setting with a constant regret increase. Finally, we establish a novel centralized instance-dependent lower bound for binary stable regret, elucidating the roles of the admissible gap and super-stable matching in characterizing the complexity of stable matching with bandit feedback.

Paperid:891 poster

Authors:Yunpeng Qu, Kun Yuan, Jinhua Hao, Kai Zhao, Qizhi Xie, Ming Sun, Chao Zhou

Title: Visual Autoregressive Modeling for Image Super-Resolution

Abstract:
Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes are released at \url{https://github.com/quyp2000/VARSR}.

Paperid:892 poster

Authors:Shijie Li, Weiqiang He, Ruobing Bai, Pan Peng

Title: Average Sensitivity of Hierarchical $k$-Median Clustering

Abstract:
Hierarchical clustering is a widely used method for unsupervised learning with numerous applications. However, in the application of modern algorithms, the datasets studied are usually large and dynamic. If the hierarchical clustering is sensitive to small perturbations of the dataset, the usability of the algorithm will be greatly reduced.In this paper, we focus on the hierarchical $k$ -median clustering problem, which bridges hierarchical and centroid-based clustering while offering theoretical appeal, practical utility, and improved interpretability. We analyze the average sensitivity of algorithms for this problem by measuring the expected change in the output when a random data point is deleted. We propose an efficient algorithm for hierarchical $k$-median clustering and theoretically prove its low average sensitivity and high clustering quality. Additionally, we show that single linkage clustering and a deterministic variant of the CLNSS algorithm exhibit high average sensitivity, making them less stable. Finally, we validate the robustness and effectiveness of our algorithm through experiments.

Paperid:893 poster

Authors:Ruichen Xu, Kexin Chen

Title: Rethinking Benign Overfitting in Two-Layer Neural Networks

Abstract:
Recent theoretical studies (Kou et al., 2023; Cao et al., 2022) revealed a sharp phase transition from benign to harmful overfitting when the noise-to-feature ratio exceeds a threshold—a situation common in long-tailed data distributions where atypical data is prevalent. However, such harmful overfitting rarely happens in overparameterized neural networks. Further experimental results suggested that memorization is necessary for achieving near-optimal generalization error in long-tailed data distributions (Feldman & Zhang, 2020). We argue that this discrepancy between theoretical predictions and empirical observations arises because previous feature-noise data models overlook the heterogeneous nature of noise across different data classes. In this paper, we refine the feature-noise data model by incorporating class-dependent heterogeneous noise and re-examine the overfitting phenomenon in neural networks. Through a comprehensive analysis of the training dynamics, we establish test loss bounds for the refined model. Our findings reveal that neural networks can leverage "data noise" to learn implicit features that improve the classification accuracy for long-tailed data. Our analysis also provides a training-free metric for evaluating data influence on test performance. Experimental validation on both synthetic and real-world datasets supports our theoretical results.

Paperid:894 poster

Authors:Yash, Avishek Ghosh, Nikhil Karamchandani

Title: Near Optimal Best Arm Identification for Clustered Bandits

Abstract:
This work investigates the problem of best arm identification for multi-agent multi-armed bandits. We considerNagents grouped intoMclusters, where each cluster solves a stochastic bandit problem. The mapping between agents and bandits is \textit{a priori} unknown. Each bandit is associated withKarms, and the goal is to identify the best arm for each agent under aδ-probably correct (δ-PC) framework, while minimizing sample complexity and communication overhead. We propose two novel algorithms: \emph{Clustering then Best Arm Identification} (\texttt{Cl-BAI}) and \emph{Best Arm Identification then Clustering} (\texttt{BAI-Cl}). \texttt{Cl-BAI} employs a two-phase approach that first clusters agents based on the bandit problems they are learning, followed by identifying the best arm for each cluster. \texttt{BAI-Cl} reverses the sequence by identifying the best arms first and then clustering agents accordingly. Both algorithms exploit the successive elimination framework to ensure computational efficiency and high accuracy. Theoretical analysis establishesδ-PC guarantees for both methods, derives bounds on their sample complexity, and provides a lower bound for the problem class. Moreover, whenMis small (a constant), we show that the sample complexity of (a variant of) \texttt{BAI-Cl} is (order-wise) minimax optimal. Experiments on synthetic and real-world (Movie Lens, Yelp) data demonstrates the superior performance of the proposed algorithms in terms of sample and communication efficiency, particularly in settings whereM≪N.

Paperid:895 poster

Authors:Joe Suk, Jung-hun Kim

Title: Tracking Most Significant Shifts in Infinite-Armed Bandits

Abstract:
We study an infinite-armed bandit problem where actions' mean rewards are initially sampled from areservoir distribution. Most prior works in this setting focused on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015) with the more challenging adversarial/non-stationary variant only recently studied in the context of rotting/decreasing rewards (Kim et al., 2022; 2024). Furthermore, optimal regret upper bounds were only achieved using parameter knowledge of non-stationarity and only known for certain regimes of regularity of the reservoir. This work shows the first parameter-free optimal regret bounds while also relaxing these distributional assumptions. We also study a natural notion ofsignificant shiftfor this problem inspired by recent developments in finite-armed MAB (Suk & Kpotufe, 2022). We show that tighter regret bounds in terms of significant shifts can be adaptively attained. Our enhanced rates only depend on the rotting non-stationarity and thus exhibit an interesting phenomenon for this problem where rising non-stationarity does not factor into the difficulty of non-stationarity.

Paperid:896 poster

Authors:Zhixian Xie, Haode Zhang, Yizhe Feng, Wanxin Jin

Title: Robust Reward Alignment via Hypothesis Space Batch Cutting

Abstract:
Reward design in reinforcement learning and optimal control is challenging. Preference-based alignment addresses this by enabling agents to learn rewards from ranked trajectory pairs provided by humans. However, existing methods often struggle from poor robustness to unknown false human preferences. In this work, we propose a robust and efficient reward alignment method based on a novel and geometrically interpretable perspective: hypothesis space batched cutting. Our method iteratively refines the reward hypothesis space through “cuts�?based on batches of human preferences. Within each batch, human preferences, queried based on disagreement, are grouped using a voting function to determine the appropriate cut, ensuring a bounded human query complexity. To handle unknown erroneous preferences, we introduce a conservative cutting method within each batch, preventing erroneous human preferences from making overly aggressive cuts to the hypothesis space. This guarantees provable robustness against false preferences, while eliminating the need to explicitly identify them. We evaluate our method in a model predictive control setting across diverse tasks. The results demonstrate that our framework achieves comparable or superior performance to state-of-the-art methods in error-free settings while significantly outperforming existing methods when handling a high percentage of erroneous human preferences.

Paperid:897 poster

Authors:Cyrille Kone, Emilie Kaufmann, Laura Richert

Title: Constrained Pareto Set Identification with Bandit Feedback

Abstract:
In this paper, we address the problem of identifying the Pareto Set under feasibility constraints in a multivariate bandit setting. Specifically, given a $K$-armed bandit with unknown means $\mu_1, \dots, \mu_K \in \mathbb{R}^d$, the goal is to identify the set of arms whose mean is not uniformly worse than that of another arm (i.e., not smaller for all objectives), while satisfying some known set of linear constraints, expressing, for example, some minimal performance on each objective. Our focus lies in fixed-confidence identification, for which we introduce an algorithm that significantly outperforms racing-like algorithms and the intuitive two-stage approach that first identifies feasible arms and then their Pareto Set. We further prove an information-theoretic lower bound on the sample complexity of any algorithm for constrained Pareto Set identification, showing that the sample complexity of our approach is near-optimal. Our theoretical results are supported by an extensive empirical evaluation on a series of benchmarks.

Paperid:898 poster

Authors:Walid Durani, Tobias Nitzl, Claudia Plant, Christian Böhm

Title: Weakly Supervised Anomaly Detection via Dual-Tailed Kernel

Abstract:
Detecting anomalies with limited supervision is challenging due to the scarcity of labeled anomalies, which often fail to capture the diversity of abnormal behaviors. We propose Weakly Supervised Anomaly Detection via Dual-Tailed Kernel (WSAD-DT), a novel framework that learns robust latent representations to distinctly separate anomalies from normal samples under weak supervision. WSAD-DT introduces two centroids—one for normal samples and one for anomalies—and leverages a dual-tailed kernel scheme: a light-tailed kernel to compactly model in-class points and a heavy-tailed kernel to main- tain a wider margin against out-of-class instances. To preserve intra-class diversity, WSAD-DT in- corporates kernel-based regularization, encouraging richer representations within each class. Furthermore, we devise an ensemble strategy that partition unlabeled data into diverse subsets, while sharing the limited labeled anomalies among these partitions to maximize their impact. Empirically, WSAD-DT achieves state-of-the-art performance on several challenging anomaly detection benchmarks, outperforming leading ensemble-based methods such as XGBOD.

Paperid:899 poster

Authors:Kasia Kobalczyk, Mihaela van der Schaar

Title: Preference Learning for AI Alignment: a Causal Perspective

Abstract:
Reward modelling from preference data is a crucial step in aligning large language models (LLMs) with human values, requiring robust generalisation to novel prompt-response pairs. In this work, we propose to frame this problem in a causal paradigm, providing the rich toolbox of causality to identify the persistent challenges, such as causal misidentification, preference heterogeneity, and confounding due to user-specific factors. Inheriting from the literature of casual inference, we identify key assumptions necessary for reliable generalisation and contrast them with common data collection practices. We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness. Finally, we outline desiderata for future research and practices, advocating targeted interventions to address inherent limitations of observational data.

Paperid:900 poster

Authors:Andrei Polubarov, Lyubaykin Nikita, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, Vladislav Kurenkov

Title: Vintix: Action Model via In-Context Reinforcement Learning

Abstract:
In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems.

Paperid:901 poster

Authors:YiFan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Tingting Gao, Zhang Zhang, Fan Yang, Di ZHANG, Liang Wang, Rong Jin

Title: MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Abstract:
Existing efforts to align multimodal large language models (MLLMs) with human preferences have only achieved progress in narrow areas, such as hallucination reduction, but remain limited in practical applicability and generalizability. To this end, we introduceMM-RLHF, a dataset containing120kfine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce theCritique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we proposeDynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across10distinct dimensions, encompassing27benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure.1).

Paperid:902 poster

Authors:Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Title: Learning Adversarial MDPs with Stochastic Hard Constraints

Abstract:
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater's parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.

Paperid:903 poster

Authors:Thanh Tran, Hoang V. Tran, Thanh Chu, Huyen Trang Pham, Laurent Ghaoui, Tam Le, Tan Minh Nguyen

Title: Tree-Sliced Wasserstein Distance with Nonlinear Projection

Abstract:
Tree-Sliced methods have recently emerged as an alternative to the traditional Sliced Wasserstein (SW) distance, replacing one-dimensional lines with tree-based metric spaces and incorporating a splitting mechanism for projecting measures. This approach enhances the ability to capture the topological structures of integration domains in Sliced Optimal Transport while maintaining low computational costs. Building on this foundation, we propose a novel nonlinear projectional framework for the Tree-Sliced Wasserstein (TSW) distance, substituting the linear projections in earlier versions with general projections, while ensuring the injectivity of the associated Radon Transform and preserving the well-definedness of the resulting metric. By designing appropriate projections, we construct efficient metrics for measures on both Euclidean spaces and spheres. Finally, we validate our proposed metric through extensive numerical experiments for Euclidean and spherical datasets. Applications include gradient flows, self-supervised learning, and generative models, where our methods demonstrate significant improvements over recent SW and TSW variants.

Paperid:904 poster

Authors:Zongzhao Li, Jiacheng Cen, Bing Su, Tingyang Xu, Yu Rong, Deli Zhao, Wenbing Huang

Title: Large Language-Geometry Model: When LLM meets Equivariance

Abstract:
Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforceE(3)-equivariance, but they often fail in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integratesE(3)-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adapter. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adapter modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.

Paperid:905 poster

Authors:Carles Balsells-Rodas, Xavier Sumba, Tanmayee Narendra, Ruibo Tu, Gabriele Schweikert, Hedvig Kjellstrom, Yingzhen Li

Title: Causal Discovery from Conditionally Stationary Time Series

Abstract:
Causal discovery, i.e., inferring underlying causal relationships from observational data, is highly challenging for AI systems. In a time series modeling context, traditional causal discovery methods mainly consider constrained scenarios with fully observed variables and/or data from stationary time-series. We develop a causal discovery approach to handle a wide class of nonstationary time series that areconditionally stationary, where the nonstationary behaviour is modeled as stationarity conditioned on a set of latent state variables. Named State-Dependent Causal Inference (SDCI), our approach is able to recover the underlying causal dependencies, with provable identifiablity for the state-dependent causal structures. Empirical experiments on nonlinear particle interaction data and gene regulatory networks demonstrate SDCI's superior performance over baseline causal discovery methods. Improved results over non-causal RNNs on modeling NBA player movements demonstrate the potential of our method and motivate the use of causality-driven methods for forecasting.

Paperid:906 poster

Authors:Pierre-Louis Cauvin, Davide Legacci, Panayotis Mertikopoulos

Title: The impact of uncertainty on regularized learning in games

Abstract:
In this paper, we investigate how randomness and uncertainty influence learning in games. Specifically, we examine a perturbed variant of the dynamics of “follow-the-regularized-leader�?(FTRL), where the players�?payoff observations and strategy updates are continually impacted by random shocks. Our findings reveal that, in a fairly precise sense, “uncertainty favors extremes�? in any game, regardless of the noise level, every player’s trajectory of play reaches an arbitrarily small neighborhood of a pure strategy in finite time (which we estimate). Moreover, even if the player does not ultimately settle at this strategy, they return arbitrarily close to some (possibly different) pure strategy infinitely often. This prompts the question of which sets of pure strategies emerge as robust predictions of learning under uncertainty. We show that (a) the only possible limits of the FTRL dynamics under uncertainty are pure Nash equilibria; and (b) a span of pure strategies is stable and attracting if and only if it is closed under better replies. Finally, we turn to games where the deterministic dynamics are recurrent—such as zero-sum games with interior equilibria—and show that randomness disrupts this behavior, causing the stochastic dynamics to drift toward the boundary on average.

Paperid:907 poster

Authors:Andrea Cini, Alexander Jenkins, Danilo Mandic, Cesare Alippi, Filippo Maria Bianchi

Title: Relational Conformal Prediction for Correlated Time Series

Abstract:
We address the problem of uncertainty quantification in time series forecasting by exploiting observations at correlated sequences. Relational deep learning methods leveraging graph representations are among the most effective tools for obtaining point estimates from spatiotemporal data and correlated time series. However, the problem of exploiting relational structures to estimate the uncertainty of such predictions has been largely overlooked in the same context. To this end, we propose a novel distribution-free approach based on the conformal prediction framework and quantile regression. Despite the recent applications of conformal prediction to sequential data, existing methods operate independently on each target time series and do not account for relationships among them when constructing the prediction interval. We fill this void by introducing a novel conformal prediction method based on graph deep learning operators. Our approach, named Conformal Relational Prediction (CoRel), does not require the relational structure (graph) to be known a priori and can be applied on top of any pre-trained predictor. Additionally, CoRel includes an adaptive component to handle non-exchangeable data and changes in the input time series. Our approach provides accurate coverage and achieves state-of-the-art uncertainty quantification in relevant benchmarks.

Paperid:908 poster

Authors:Thomas Rupf, Marco Bagatella, Nico Gürtler, Jonas Frey, Georg Martius

Title: Zero-Shot Offline Imitation Learning via Optimal Transport

Abstract:
Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent's immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks. The code is available athttps://github.com/martius-lab/zilot.

Paperid:909 poster

Authors:Siyuan Duan, Yuan Sun, Dezhong Peng, Guiduo Duan, Xi Peng, Peng Hu

Title: Deep Fuzzy Multi-view Learning for Reliable Classification

Abstract:
Multi-view learning methods primarily focus on enhancing decision accuracy but often neglect the uncertainty arising from the intrinsic drawbacks of data, such as noise, conflicts, etc. To address this issue, several trusted multi-view learning approaches based on the Evidential Theory have been proposed to capture uncertainty in multi-view data. However, their performance is highly sensitive to conflicting views, and their uncertainty estimates, which depend on the total evidence and the number of categories, often underestimate uncertainty for conflicting multi-view instances due to the neglect of inherent conflicts between belief masses. To accurately classify conflicting multi-view instances and precisely estimate their intrinsic uncertainty, we present a novel Deep \underline{Fu}zzy \underline{M}ulti-View \underline{L}earning (\textbf{FUML}) method. Specifically, FUML leverages Fuzzy Set Theory to model the outputs of a classification neural network as fuzzy memberships, incorporating both possibility and necessity measures to quantify category credibility. A tailored loss function is then proposed to optimize the category credibility. To further enhance uncertainty estimation, we propose an entropy-based uncertainty estimation method leveraging category credibility. Additionally, we develop a Dual Reliable Multi-view Fusion (DRF) strategy that accounts for both view-specific uncertainty and inter-view conflict to mitigate the influence of conflicting views in multi-view fusion. Extensive experiments demonstrate that our FUML achieves state-of-the-art performance in terms of both accuracy and reliability.

Paperid:910 poster

Authors:Anji Liu, Xuejie Liu, Dayuan Zhao, Mathias Niepert, Yitao Liang, Guy Van den Broeck

Title: Tractable Transformers for Flexible Conditional Generation

Abstract:
Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.

Paperid:911 poster

Authors:Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen

Title: Oscillation-Reduced MXFP4 Training for Vision Transformers

Abstract:
Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason.In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than 50% compared to the baseline, and can even achieve competitive performance compared to full precision training.

Paperid:912 poster

Authors:Chuqi CHEN, Yahong Yang, Yang Xiang, Wenrui Hao

Title: Learn Singularly Perturbed Solutions via Homotopy Dynamics

Abstract:
Solving partial differential equations (PDEs) using neural networks has become a central focus in scientific machine learning. Training neural networks for singularly perturbed problems is particularly challenging due to certain parameters in the PDEs that introduce near-singularities in the loss function. In this study, we overcome this challenge by introducing a novel method based on homotopy dynamics to effectively manipulate these parameters. From a theoretical perspective, we analyze the effects of these parameters on training difficulty in these singularly perturbed problems and establish the convergence of the proposed homotopy dynamics method. Experimentally, we demonstrate that our approach significantly accelerates convergence and improves the accuracy of these singularly perturbed problems. These findings present an efficient optimization strategy leveraging homotopy dynamics, offering a robust framework to extend the applicability of neural networks for solving singularly perturbed differential equations.

Paperid:913 poster

Authors:Saumya Gaurang Shah, Abishek Sankararaman, Balakrishnan Murali Narayanaswamy, Vikramank Singh

Title: SEAD: Unsupervised Ensemble of Streaming Anomaly Detectors

Abstract:
Can we efficiently choose the best Anomaly Detection (AD) algorithm for a data-stream without requiring anomaly labels? Streaming anomaly detection is hard. SOTA AD algorithms are sensitive to their hyperparameters and no single method works well on all datasets. The best algorithm/hyper-parameter combination for a given data-stream can change over time with data drift. 'What is an anomaly?' is often application, context and dataset dependent. We propose SEAD (Streaming Ensemble of Anomaly Detectors), the first model selection algorithm for streaming, unsupervised AD. All prior AD model selection algorithms are either supervised, or only work in the offline setting when all data from the test set is available upfront. We show that SEAD is {\em(i)} unsupervised, i.e., requires no true anomaly labels, {\em(ii)} efficiently implementable in a streaming setting, {\em (iii)} agnostic to the choice of the base algorithms among which it chooses from, and {\em (iv)} adaptive to non-stationarity in the data-stream. Experiments on 14 non-trivial public datasets and an internal dataset corroborate our claims.

Paperid:914 poster

Authors:Slobodan Jenko, Niels Mündler, Jingxuan He, Mark Vero, Martin Vechev

Title: Black-Box Adversarial Attacks on LLM-Based Code Completion

Abstract:
Modern code completion engines, powered by large language models (LLMs), assist millions of developers with their strong capabilities to generate functionally correct code. Due to this popularity, it is crucial to investigate the security implications of relying on LLM-based code completion. In this work, we demonstrate that state-of-the-art black-box LLM-based code completion engines can be stealthily biased by adversaries to significantly increase their rate of insecure code generation. We present the first attack, named INSEC, that achieves this goal. INSEC works by injecting an attack string as a short comment in the completion input. The attack string is crafted through a query-based optimization procedure starting from a set of carefully designed initialization schemes. We demonstrate INSEC's broad applicability and effectiveness by evaluating it on various state-of-the-art open-source models and black-box commercial services (e.g., OpenAI API and GitHub Copilot). On a diverse set of security-critical test cases, covering 16 CWEs across 5 programming languages, INSEC increases the rate of generated insecure code by more than 50%, while maintaining the functional correctness of generated code. We consider INSEC practical - it requires low resources and costs less than 10 US dollars to develop on commodity hardware. Moreover, we showcase the attack's real-world deployability, by developing an IDE plug-in that stealthily injects INSEC into the GitHub Copilot extension.

Paperid:915 poster

Authors:Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

Title: ETTA: Elucidating the Design Space of Text-to-Audio Models

Abstract:
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.

Paperid:916 poster

Authors:Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev

Title: Discovering Spoofing Attempts on Language Model Watermarks

Abstract:
LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat.

Paperid:917 poster

Authors:Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, Nicolas Le Roux

Title: VinePPO: Refining Credit Assignment in RL Training of LLMs

Abstract:
Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.

Paperid:918 poster

Authors:Alexandre Blain, Bertrand Thirion, Pierre Neuvial

Title: False Coverage Proportion Control for Conformal Prediction

Abstract:
Split Conformal Prediction (SCP) provides a computationally efficient way to construct confidence intervals in prediction problems. Notably, most of the theory built around SCP is focused on the single test point setting. In real-life, inference sets consist of multiple points, which raises the question of coverage guarantees for many points simultaneously. Whileon average, the False Coverage Proportion (FCP) remains controlled, it can fluctuate strongly around its mean, the False Coverage Rate (FCR). We observe that when a dataset is split multiple times, classical SCP may not control the FCP in a majority of the splits. We propose CoJER, a novel method that achieves sharp FCP control in probability for conformal prediction, based on a recent characterization of the distribution of conformal $p$-values in a transductive setting. This procedure incorporates an aggregation scheme which provides robustness with respect to modeling choices. We show through extensive real data experiments that CoJER provides FCP control while standard SCP does not. Furthermore, CoJER yields shorter intervals than thestate-of-the-art methodfor FCP control and only slightly larger intervals than standard SCP.

Paperid:919 poster

Authors:Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter

Title: xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

Abstract:
Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM’s architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM’s potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source. Model:https://huggingface.co/NX-AI/xLSTM-7bCode:https://github.com/NX-AI/xlstmandhttps://github.com/NX-AI/xlstm-jax.

Paperid:920 poster

Authors:Anh N Nhu, Sanghyun Son, Ming Lin

Title: Time-Aware World Model for Adaptive Prediction and Control

Abstract:
In this work, we introduce the Time-Aware World Model (TAWM), a model-based approach that explicitly incorporates temporal dynamics. By conditioning on the time-step size, $\Delta t$, and training over a diverse range of $\Delta t$ values �?rather than sampling at a fixed time-step �?TAWM learns both high- and low-frequency task dynamics across diverse control problems. Grounded in the information-theoretic insight that the optimal sampling rate depends on a system’s underlying dynamics, this time-aware formulation improves both performance and data efficiency. Empirical evaluations show that TAWM consistently outperforms conventional models across varying observation rates in a variety of control tasks, using the same number of training samples and iterations. Our code can be found online at: github.com/anh-nn01/Time-Aware-World-Model.

Paperid:921 poster

Authors:Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li

Title: Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Abstract:
The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses fine-grained information (such as "slightly better'"). This paper proposes a framework for learning RMs underordinal feedback, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.

Paperid:922 poster

Authors:Sagar Shrestha, Xiao Fu

Title: Diversified Flow Matching with Translation Identifiability

Abstract:
Diversified distribution matching (DDM) finds a unified translation function mapping a diverse collection of conditional source distributions to their target counterparts. DDM was proposed to resolve content misalignment issues in unpaired domain translation, achieving translation identifiability. However, DDM has only been implemented using GANs due to its constraints on the translation function. GANs are often unstable to train and do not provide the transport trajectory information---yet such trajectories are useful in applications such as single-cell evolution analysis and robot route planning. This work introducesdiversified flow matching(DFM), an ODE-based framework for DDM. Adapting flow matching (FM) to enforce a unified translation function as in DDM is challenging, as FM learns the translation function's velocity rather than the translation function itself. A custom bilevel optimization-based training loss, a nonlinear interpolant, and a structural reformulation are proposed to address these challenges, offering a tangible implementation. To our knowledge, DFM is the first ODE-based approach guaranteeing translation identifiability. Experiments on synthetic and real-world datasets validate the proposed method.

Paperid:923 poster

Authors:Sivan Sabato, Eran Treister, Elad Yom-Tov

Title: Disparate Conditional Prediction in Multiclass Classifiers

Abstract:
We propose methods for auditing multiclass classifiers for fairness under multiclass equalized odds, by estimating the deviation from equalized odds when the classifier is not completely fair. We generalize to multiclass classifiers the measure of Disparate Conditional Prediction (DCP), originally suggested by Sabato & Yom-Tov (2020) for binary classifiers. DCP is defined as the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline. We provide new local-optimization methods for estimating the multiclass DCP under two different regimes, one in which the conditional confusion matrices for each protected sub-population are known, and one in which these cannot be estimated, for instance, because the classifier is inaccessible or because good-quality individual-level data is not available. These methods can be used to detect classifiers that likely treat a significant fraction of the population unfairly. Experiments demonstrate the accuracy of the methods. The code for the experiments is provided as supplementary material.

Paperid:924 poster

Authors:Andi Han, Wei Huang, Zhanpeng Zhou, Gang Niu, Wuyang Chen, Junchi Yan, Akiko Takeda, Taiji Suzuki

Title: On the Role of Label Noise in the Feature Learning Process

Abstract:
Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signal-noise data distribution, where each sample comprises a label-dependent signal and label-independent noise, and rigorously analyze the training dynamics of a two-layer convolutional neural network under this data setup, along with the presence of label noise. Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples (i.e., samples without label noise) while ignoring the noisy ones (i.e., samples with noisy labels). During this stage, the model learns the signal from the clean samples, which generalizes well on unseen data. In Stage II, as the training loss converges, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples. Eventually, the model memorizes the noise present in the noisy samples and degrades its generalization ability. Furthermore, our analysis provides a theoretical basis for two widely used techniques for tackling label noise: early stopping and sample selection. Experiments on both synthetic and real-world setups validate our theory.

Paperid:925 poster

Authors:Ziqian Zhang, Bohan Yang, Lihe Li, Yuqi Bian, Ruiqi Xue, Feng Chen, Yi-Chen Li, Lei Yuan, Yang Yu

Title: Learning to Reuse Policies in State Evolvable Environments

Abstract:
The policy trained via reinforcement learning (RL) makes decisions based on sensor-derived state features. It is common for state features to evolve for reasons such as periodic sensor maintenance or the addition of new sensors for performance improvement. The deployed policy fails in new state space when state features are unseen during training. Previous work tackles this challenge by training a sensor-invariant policy or generating multiple policies and selecting the appropriate one with limited samples. However, both directions struggle to guarantee the performance when faced with unpredictable evolutions. In this paper, we formalize this problem as state evolvable reinforcement learning (SERL), where the agent is required to mitigate policy degradation after state evolutions without costly exploration. We proposeLapseby reusing policies learned from the old state space in two distinct aspects. On one hand, Lapse directly reuses therobustold policy by composing it with a learned state reconstruction model to handle vanishing sensors. On the other hand, the behavioral experience from the old policy is reused by Lapse to train a newly adaptive policy through offline learning, better utilizing new sensors. To leverage advantages of both policies in different scenarios, we further proposeautomatic ensemble weight adjustmentto effectively aggregate them. Theoretically, we justify that robust policy reuse helps mitigate uncertainty and error from both evolution and reconstruction. Empirically, Lapse achieves a significant performance improvement, outperforming the strongest baseline by about2×in benchmark environments.

Paperid:926 poster

Authors:Chang Cao, Han Li, Yulong Wang, Rui Wu, Hong Chen

Title: Adversarial Robust Generalization of Graph Neural Networks

Abstract:
While Graph Neural Networks (GNNs) have shown outstanding performance in node classification tasks, they are vulnerable to adversarial attacks, which are imperceptible changes to input samples. Adversarial training, as a widely used tool to enhance the adversarial robustness of GNNs, has presented remarkable effectiveness in node classification tasks. However, the generalization properties for explaining their behaviors remain not well understood from the theoretical viewpoint. To fill this gap, we develop a high probability generalization bound of general GNNs in adversarial learning through covering number analysis. We estimate the covering number of the GNN model class based on the entire perturbed feature matrix by constructing a cover for the perturbation set. Our results are generally applicable to a series of GNNs. We demonstrate their applicability by investigating the generalization performance of several popular GNN models under adversarial attacks, which reveal the architecture-related factors influencing the generalization gap. Our experimental results on benchmark datasets provide evidence that supports the established theoretical findings.

Paperid:927 poster

Authors:Yuxiao Wen

Title: Adversarial Combinatorial Semi-bandits with Graph Feedback

Abstract:
In combinatorial semi-bandits, a learner repeatedly selects from a combinatorial decision set of arms, receives the realized sum of rewards, and observes the rewards of the individual selected arms as feedback. In this paper, we extend this framework to include \emph{graph feedback}, where the learner observes the rewards of all neighboring arms of the selected arms in a feedback graphG. We establish that the optimal regret over a time horizonTscales asΘ~(ST+αST), whereSis the size of the combinatorial decisions andαis the independence number ofG. This result interpolates between the known regretsΘ~(ST)under full information (i.e.,Gis complete) andΘ~(KST)under the semi-bandit feedback (i.e.,Ghas only self-loops), whereKis the total number of arms. A key technical ingredient is to realize a convexified action using a random decision vector with negative correlations. We also show that online stochastic mirror descent (OSMD) that only realizes convexified actions in expectation is suboptimal.

Paperid:928 poster

Authors:Jinyu Cai, Yunhe Zhang, Jicong Fan

Title: Self-Discriminative Modeling for Anomalous Graph Detection

Abstract:
Identifying anomalous graphs is essential in real-world scenarios such as molecular and social network analysis, yet anomalous samples are generally scarce and unavailable. This paper proposes a Self-Discriminative Modeling (SDM) framework that trains a deep neural network only on normal graphs to detect anomalous graphs. The neural network simultaneously learns to construct pseudo-anomalous graphs from normal graphs and learns an anomaly detector to recognize these pseudo-anomalous graphs. As a result, these pseudo-anomalous graphs interpolate between normal graphs and real anomalous graphs, which leads to a reliable decision boundary of anomaly detection. In this framework, we develop three algorithms with different computational efficiencies and stabilities for anomalous graph detection. Extensive experiments on 12 different graph benchmarks demonstrated that the three variants of SDM consistently outperform the state-of-the-art GLAD baselines. The success of our methods stems from the integration of the discriminative classifier and the well-posed pseudo-anomalous graphs, which provided new insights for graph-level anomaly detection.

Paperid:929 poster

Authors:Yaming Guo, Siyang Guo, Hengshu Zhu, Ying Sun

Title: Towards Lifelong Model Editing via Simulating Ideal Editor

Abstract:
Model editing plays a crucial role in the cost-effective development of large language models, and the challenge of evolving knowledge facilitates its sequential extension, namely lifelong model editing. However, progress on standard and lifelong editing has historically followed separate tracks, overlooking the potential of generalizing standard methods to lifelong scenarios. By establishing this bridge, we can provide robust baselines in lifelong scenarios and ensure that lifelong editing benefits from the ongoing advancements in standard editing technologies. In response, this paper proposes a general framework,SimulatingIdealEditor(SimIE), which restores the strong performance of parameter-modifying methods from standard model editing in a lifelong context. SimIE formulates the ideal parameter shift as the minimum-norm solution to a linear system, constructed using the Moore-Penrose inverse, and subsequently enables recursive updates by truncating the limiting expression of the Moore-Penrose inverse under two mild assumptions. Theoretically, we demonstrate that if either assumption is not met, the solution provided by SimIE remains near-optimal in a statistical sense or stable against perturbations introduced by the sequential editing, but a trade-off between optimality and stability arises when both assumptions fail. Extensive experiments validate the effectiveness of SimIE, which allows standard algorithms to achieve performance comparable to specialized lifelong model editing methods. Our code is available athttps://github.com/YamingGuo98/SimIE.

Paperid:930 poster

Authors:Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr Koniusz, Kai Zou, Anton van den Hengel, Yuan Xue

Title: Primitive Vision: Improving Diagram Understanding in MLLMs

Abstract:
Mathematical diagrams have a distinctive structure. Standard feature transforms designed for natural images (e.g., CLIP) fail to process them effectively, limiting their utility in multimodal large language models (MLLMs). Current efforts to improve MLLMs have primarily focused on scaling mathematical visual instruction datasets and strengthening LLM backbones, yet fine‐grained visual recognition errors remain unaddressed. Our systematic evaluation on the visual grounding capabilities of state‐of‐the‐art MLLMs highlights that fine‐grained visual understanding remains a crucial bottleneck in visual mathematical reasoning (GPT-4o exhibits a 70% grounding error rate, and correcting these errors improves reasoning accuracy by 12%). We thus propose a novel approach featuring a geometrically‐grounded vision encoder and a feature router that dynamically selects between hierarchical visual feature maps. Our model accurately recognizes visual primitives and generates precise visual prompts aligned with the language model’s reasoning needs. In experiments, PRIMITIVE-Qwen2.5-7B outperforms other 7B models by 12% on MathVerse and is on par with GPT-4V on MathVista. Our findings highlight the need for better fine‐grained visual integration in MLLMs. Code is available at github.com/AI4Math-ShanZhang/SVE-Math.

Paperid:931 poster

Authors:Yunjuan Wang, Raman Arora

Title: Backdoor Attacks in Token Selection of Attention Mechanism

Abstract:
Despite the remarkable success of large foundation models across a range of tasks, they remain susceptible to security threats such as backdoor attacks. By injecting poisoned data containing specific triggers during training, adversaries can manipulate model predictions in a targeted manner. While prior work has focused on empirically designing and evaluating such attacks, a rigorous theoretical understanding of when and why they succeed is lacking. In this work, we analyze backdoor attacks that exploit the token selection process within attention mechanisms--a core component of transformer-based architectures. We show that single-head self-attention transformers trained via gradient descent can interpolate poisoned training data. Moreover, we prove that when the backdoor triggers are sufficiently strong but not overly dominant, attackers can successfully manipulate model predictions. Our analysis characterizes how adversaries manipulate token selection to alter outputs and identifies the theoretical conditions under which these attacks succeed. We validate our findings through experiments on synthetic datasets.

Paperid:932 poster

Authors:Jinghang Yue, Jing Wang, Lu Zhang, Shuo Zhang, Da Li, Zhaoyang Ma, Youfang Lin

Title: Optimal Information Retention for Time-Series Explanations

Abstract:
Explaining deep models for time-series data is crucial for identifying key patterns in sensitive domains, such as healthcare and finance. However, due to the lack of unified optimization criterion, existing explanation methods often suffer from redundancy and incompleteness, where irrelevant patterns are included or key patterns are missed in explanations. To address this challenge, we propose the Optimal Information Retention Principle, where conditional mutual information defines minimizing redundancy and maximizing completeness as optimization objectives. We then derive the corresponding objective function theoretically. As a practical framework, we introduce an explanation framework ORTE, learning a binary mask to eliminate redundant information while mining temporal patterns of explanations. We decouple the discrete mapping process to ensure the stability of gradient propagation, while employing contrastive learning to achieve precise filtering of explanatory patterns through the mask, thereby realizing a trade-off between low redundancy and high completeness. Extensive quantitative and qualitative experiments on synthetic and real-world datasets demonstrate that the proposed principle significantly improves the accuracy and completeness of explanations compared to baseline methods. The code is available athttps://github.com/moon2yue/ORTE_public.

Paperid:933 poster

Authors:Benjamin Plaut, Hanlin Zhu, Stuart Russell

Title: Avoiding Catastrophe in Online Learning by Asking for Help

Abstract:
Most learning algorithms with formal regret guarantees assume that all mistakes are recoverable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes arecatastrophic, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe in that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We also assume that the agent can transfer knowledge between similar inputs. We first show that in general, any algorithm either queries the mentor at a linear rate or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Although our focus is the product of payoffs, we provide matching bounds for the typical additive regret. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

Paperid:934 poster

Authors:Nitin Bisht, Xiuwen Gong, Guandong Xu

Title: ENSUR: Equitable and Statistically Unbiased Recommendation

Abstract:
Although Recommender Systems (RS) have been well-developed for various fields of applications, they often suffer from a crisis of platform credibility with respect to RS confidence and fairness, which may drive users away, threatening the platform's long-term success. In recent years, some works have tried to solve these issues; however, they lack strong statistical guarantees. Therefore, there is an urgent need to solve both issues with a unifying framework with robust statistical guarantees. In this paper, we propose a novel and reliable framework called Equitable and Statistically Unbiased Recommendation (ENSUR)) to dynamically generate prediction sets for users across various groups, which are guaranteed 1) to include ground-truth items with user-predefined high confidence/probability (e.g., 90%); 2) to ensure user fairness across different groups; 3) to have minimum efficient average prediction set sizes. We further design an efficient algorithm named Guaranteed User Fairness Algorithm (GUFA) to optimize the proposed method and derive upper bounds of risk and fairness metrics to speed up the optimization process. Moreover, we provide rigorous theoretical analysis concerning risk and fairness control and minimum set size. Extensive experiments validate the effectiveness of the proposed framework, which aligns with our theoretical analysis.

Paperid:935 poster

Authors:Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, David Rügamer

Title: Can Transformers Learn Full Bayesian Inference in Context?

Abstract:
Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context—without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows and enables us to infer complex posterior distributions for models such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods that do not operate in context. The source code for this paper is available athttps://github.com/ArikReuter/ICL_for_Full_Bayesian_Inference

Paperid:936 poster

Authors:Zixiang Ai, Zichen Liu, Jiahuan Zhou

Title: Vision Graph Prompting via Semantic Low-Rank Decomposition

Abstract:
Vision GNN (ViG) demonstrates superior performance by representing images as graph structures, providing a more natural way to capture irregular semantic patterns beyond traditional grid or sequence-based representations. To efficiently adapt ViG to downstream tasks, parameter-efficient fine-tuning techniques like visual prompting become increasingly essential. However, existing prompting methods are primarily designed for Transformer-based models, neglecting the rich topological relationships among nodes and edges in graph-based representations, limiting their capacity to model complex semantics. In this paper, we propose Vision Graph Prompting (VGP), a novel framework tailored for vision graph structures. Our core insight reveals that semantically connected components in the graph exhibit low-rank properties. Building on this observation, we introduce a semantic low-rank prompting method that decomposes low-rank semantic features and integrates them with prompts on vision graph topologies, capturing both global structural patterns and fine-grained semantic dependencies. Extensive experiments demonstrate our method significantly improves ViG’s transfer performance on diverse downstream tasks, achieving results comparable to full fine-tuning while maintaining parameter efficiency.

Paperid:937 poster

Authors:Henrik von Kleist, Joshua Wendland, Ilya Shpitser, Carsten Marr

Title: Feature Importance Metrics in the Presence of Missing Data

Abstract:
Feature importance metrics are critical for interpreting machine learning models and understanding the relevance of individual features. However, real-world data often exhibit missingness, thereby complicating how feature importance should be evaluated. We introduce the distinction between two evaluation frameworks under missing data: (1) feature importance under the full data, as if every feature had been fully measured, and (2) feature importance under the observed data, where missingness is governed by the current measurement policy. While the full data perspective offers insights into the data generating process, it often relies on unrealistic assumptions and cannot guide decisions when missingness persists at model deployment. Since neither framework directly informs improvements in data collection, we additionally introduce the feature measurement importance gradient (FMIG), a novel, model-agnostic metric that identifies features that should be measured more frequently to enhance predictive performance. Using synthetic data, we illustrate key differences between these metrics and the risks of conflating them.

Paperid:938 poster

Authors:Chengpiao Huang, Yuhang Wu, Kaizheng Wang

Title: Uncertainty Quantification for LLM-Based Survey Simulations

Abstract:
We investigate the use of large language models (LLMs) to simulate human responses to survey questions, and perform uncertainty quantification to gain reliable insights. Our approach converts imperfect, LLM-simulated responses into confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. A key innovation lies in determining the optimal number of simulated responses: too many produce overly narrow confidence sets with poor coverage, while too few yield excessively loose estimates. To resolve this, our method adaptively selects the simulation sample size, ensuring valid average-case coverage guarantees. It is broadly applicable to any LLM, irrespective of its fidelity, and any procedure for constructing confidence sets. Additionally, the selected sample size quantifies the degree of misalignment between the LLM and the target human population. We illustrate our method on real datasets and LLMs.

Paperid:939 poster

Authors:Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt

Title: Eliciting Language Model Behaviors with Investigator Agents

Abstract:
Language models exhibit complex, diverse behaviors when prompted with free-form text, making it hard to characterize the space of possible outputs. We study the problem of behavioral elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations, harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train amortized investigator models to emulate the posterior distribution over the prompts, conditioned on the target behavior. Specifically, we first fit a reverse model and then use reinforcement learning to optimize likelihood of generating the target behavior. To improve the diversity of the prompt distribution, we further propose a novel iterative training objective based on the Frank-Wolfe algorithm that encourages each iteration to discover different sets of prompts not captured by previous iterations. Our investigator models produce prompts that exhibit a variety of effective and human-interpretable strategies for behavior elicitation, obtaining a 100% attack success rate on AdvBench (Harmful Behaviors) and an 85% hallucination rate.

Paperid:940 poster

Authors:Jihao Andreas Lin, Sebastian Ament, Maximilian Balandat, David Eriksson, José Miguel Hernández-Lobato, Eytan Bakshy

Title: Scalable Gaussian Processes with Latent Kronecker Structure

Abstract:
Applying Gaussian processes (GPs) to very large datasets remains a challenge due to limited computational scalability. Matrix structures, such as the Kronecker product, can accelerate operations significantly, but their application commonly entails approximations or unrealistic assumptions. In particular, the most common path to creating a Kronecker-structured kernel matrix is by evaluating a product kernel on gridded inputs that can be expressed as a Cartesian product. However, this structure is lost if any observation is missing, breaking the Cartesian product structure, which frequently occurs in real-world data such as time series. To address this limitation, we propose leveraging latent Kronecker structure, by expressing the kernel matrix of observed values as the projection of a latent Kronecker product. In combination with iterative linear system solvers and pathwise conditioning, our method facilitates inference of exact GPs while requiring substantially fewer computational resources than standard iterative methods. We demonstrate that our method outperforms state-of-the-art sparse and variational GPs on real-world datasets with up to five million examples, including robotics, automated machine learning, and climate applications.

Paperid:941 poster

Authors:Ke Zhu, Jianing Chu, Ilya Lipkovich, Wenyu Ye, Shu Yang

Title: Doubly Robust Fusion of Many Treatments for Policy Learning

Abstract:
Individualized treatment rules/recommendations (ITRs) aim to improve patient outcomes by tailoring treatments to the characteristics of each individual. However, in high-dimensional treatment settings, existing methods face significant challenges due to data sparsity within treatment groups and highly unbalanced covariate distributions across groups. To address these challenges, we propose a novel calibration-weighted treatment fusion procedure that robustly balances covariates across treatment groups and fuses similar treatments using a penalized working model. The fusion procedure ensures the recovery of latent treatment group structures when either the calibration model or the outcome model is correctly specified. In the fused treatment space, practitioners can seamlessly apply state-of-the-art ITR learning methods with the flexibility to utilize a subset of covariates, thereby achieving robustness while addressing practical concerns such as fairness. We establish theoretical guarantees, including consistency, the oracle property of treatment fusion, and regret bounds when integrated with multi-armed ITR learning methods such as policy trees. Simulation studies show superior group recovery and policy value compared to existing approaches. We illustrate the practical utility of our method using EHR-derived data from patients with Chronic Lymphocytic Leukemia and Small Lymphocytic Lymphoma.

Paperid:942 poster

Authors:Francesco Tonin, Alex Lambert, Johan Suykens, Volkan Cevher

Title: Accelerating Spectral Clustering under Fairness Constraints

Abstract:
Fairness of decision-making algorithms is an increasingly important issue. In this paper, we focus on spectral clustering with group fairness constraints, where every demographic group is represented in each cluster proportionally as in the general population. We present a new efficient method for fair spectral clustering (Fair SC) by casting the Fair SC problem within the difference of convex functions (DC) framework. To this end, we introduce a novel variable augmentation strategy and employ an alternating direction method of multipliers type of algorithm adapted to DC problems. We show that each associated subproblem can be solved efficiently, resulting in higher computational efficiency compared to prior work, which required a computationally expensive eigendecomposition. Numerical experiments demonstrate the effectiveness of our approach on both synthetic and real-world benchmarks, showing significant speedups in computation time over prior art, especially as the problem size grows. This work thus represents a considerable step forward towards the adoption of fair clustering in real-world applications.

Paperid:943 poster

Authors:Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

Title: LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Abstract:
Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Paperid:944 poster

Authors:Toon Vanderschueren, Tim Verdonck, Mihaela van der Schaar, Wouter Verbeke

Title: AutoCATE: End-to-End, Automated Treatment Effect Estimation

Abstract:
Estimating causal effects is crucial in domains like healthcare, economics, and education. Despite advances in machine learning (ML) for estimating conditional average treatment effects (CATE), the practical adoption of these methods remains limited, due to the complexities of implementing, tuning, and validating them. To address these challenges, we formalize the search for an optimal ML pipeline for CATE estimation as a counterfactual Combined Algorithm Selection and Hyperparameter (CASH) optimization. We introduce AutoCATE, the first end-to-end, automated solution for CATE estimation. Unlike prior approaches that address only parts of this problem, AutoCATE integrates evaluation, estimation, and ensembling in a unified framework. AutoCATE enables comprehensive comparisons of different protocols, yielding novel insights into CATE estimation and a final configuration that outperforms commonly used strategies. To facilitate broad adoption and further research, we release AutoCATE as an open-source software package.

Paperid:945 poster

Authors:Zihan Huang, Wei Fang, Tong Bu, Peng Xue, Zecheng Hao, Wenxuan Liu, Yuanhong Tang, Zhaofei Yu, Tiejun Huang

Title: Differential Coding for Training-Free ANN-to-SNN Conversion

Abstract:
Spiking Neural Networks (SNNs) exhibit significant potential due to their low energy consumption. Converting Artificial Neural Networks (ANNs) to SNNs is an efficient way to achieve high-performance SNNs. However, many conversion methods are based on rate coding, which requires numerous spikes and longer time-steps compared to directly trained SNNs, leading to increased energy consumption and latency. This article introduces differential coding for ANN-to-SNN conversion, a novel coding scheme that reduces spike counts and energy consumption by transmitting changes in rate information rather than rates directly, and explores its application across various layers. Additionally, the threshold iteration method is proposed to optimize thresholds based on activation distribution when converting Rectified Linear Units (ReLUs) to spiking neurons. Experimental results on various Convolutional Neural Networks (CNNs) and Transformers demonstrate that the proposed differential coding significantly improves accuracy while reducing energy consumption, particularly when combined with the threshold iteration method, achieving state-of-the-art performance. The source codes of the proposed method are available athttps://github.com/h-z-h-cell/ANN-to-SNN-DCGS.

Paperid:946 poster

Authors:Joyentanuj Das, Suranjan De, He Sun

Title: Online Sparsification of Bipartite-Like Clusters in Graphs

Abstract:
Graph clustering is an important algorithmic technique for analysing massive graphs, and has been widely applied in many research fields of data science. While the objective of most graph clustering algorithms is to find a vertex set of low conductance, there has been a sequence of recent studies that highlight the importance of the inter-connection between clusters when analysing real-world datasets. Following this line of research, in this work we study bipartite-like clusters and present efficient and online algorithms that find such clusters in both undirected graphs and directed ones. We conduct experimental studies on both synthetic and real-world datasets, and show that our algorithms significantly speedup the running time of existing clustering algorithms while preserving their effectiveness.

Paperid:947 poster

Authors:Xinyu Yuan, Zichen Wang, Marcus D. Collins, Huzefa Rangwala

Title: Protein Structure Tokenization: Benchmarking and New Recipe

Abstract:
Recent years have witnessed a surge in the development of protein structural tokenization methods, which chunk protein 3D structures into discrete or continuous representations. Structure tokenization enables the direct application of powerful techniques like language modeling for protein structures, and large multimodal models to integrate structures with protein sequences and functional texts. Despite the progress, the capabilities and limitations of these methods remain poorly understood due to the lack of a unified evaluation framework. We first introduceStructTokenBench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine-grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. Observations of codebook under-utilization led us to developAminoAseed, a simple yet effective strategy that enhances codebook gradient updates and optimally balances codebook size and dimension for improved tokenizer utilization and quality. Compared to the leading model ESM3, our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively. Source code and model weights are available athttps://github.com/KatarinaYuan/StructTokenBench.

Paperid:948 poster

Authors:Alaa Anani, Tobias Lorenz, Mario Fritz, Bernt Schiele

Title: Pixel-level Certified Explanations via Randomized Smoothing

Abstract:
Post-hoc attribution methods aim to explain deep learning predictions by highlighting influential input pixels. However, these explanations are highly non-robust: small, imperceptible input perturbations can drastically alter the attribution map while maintaining the same prediction. This vulnerability undermines their trustworthiness and calls for rigorous robustness guarantees of pixel-level attribution scores. We introduce the first certification framework that guarantees pixel-level robustness for any black-box attribution method using randomized smoothing. By sparsifying and smoothing attribution maps, we reformulate the task as a segmentation problem and certify each pixel's importance against $\ell_2$-bounded perturbations. We further propose three evaluation metrics to assess certified robustness, localization, and faithfulness. An extensive evaluation of 12 attribution methods across 5 ImageNet models shows that our certified attributions are robust, interpretable, and faithful, enabling reliable use in downstream tasks. Our code is athttps://github.com/AlaaAnani/certified-attributions.

Paperid:949 poster

Authors:Jueqing Lu, Wray Buntine, YUANYUAN QI, Joanna Dipnall, Belinda Gabbe, Lan Du

Title: Navigating Conflicting Views: Harnessing Trust for Learning

Abstract:
Resolving conflicts is critical for improving the reliability of multi-view classification. While prior work focuses on learning consistent and informative representations across views, it often assumes perfect alignment and equal importance of all views, an assumption rarely met in real-world scenarios, as some views may express distinct information. To address this, we develop a computational trust-based discounting method that enhances the Evidential Multi-view framework by accounting for the instance-wise reliability of each view through a probability-sensitive trust mechanism. We evaluate our method on six real-world datasets using Top-1 Accuracy, Fleiss�?Kappa, and a new metric, Multi-View Agreement with Ground Truth, to assess prediction reliability. We also assess the effectiveness of uncertainty in indicating prediction correctness via AUROC. Additionally, we test the scalability of our method through end-to-end training on a large-scale dataset. The experimental results show that computational trust can effectively resolve conflicts, paving the way for more reliable multi-view classification models in real-world applications. Codes available at:https://github.com/OverfitFlow/Trust4Conflict

Paperid:950 poster

Authors:Mingyi Li, Michael R. Metel, Akiko Takeda

Title: Modified K-means Algorithm with Local Optimality Guarantees

Abstract:
The K-means algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on its ability to achieve a globally optimal solution, there still lacks a rigorous analysis of its local optimality guarantees. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in both the continuous and discrete sense, with the same computational complexity as the original K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. Our code is available athttps://github.com/lmingyi/LO-K-means.

Paperid:951 poster

Authors:Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, Martin Hirzel

Title: Otter: Generating Tests from Issues to Validate SWE Patches

Abstract:
While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. This paper focuses on the scenario where that code patch does not yet exist. Doing so supports two major use-cases. First, it supports TDD (test-driven development), the discipline of "test first, write code later" that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces TDD-Bench-Verified, a benchmark for generating tests from issues, and Otter, an LLM-based solution for this task. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planner. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.

Paperid:952 poster

Authors:Can Chen, Karla-Luise Herpoldt, Chenchao Zhao, Zichen Wang, Marcus D. Collins, Shang Shang, Ron Benson

Title: AffinityFlow: Guided Flows for Antibody Affinity Maturation

Abstract:
Antibodies are widely used as therapeutics, but their development requires costly affinity maturation, involving iterative mutations to enhance binding affinity. This paper explores a sequence-only scenario for affinity maturation, using solely antibody and antigen sequences. Recently AlphaFlow wraps AlphaFold within flow matching to generate diverse protein structures, enabling a sequence-conditioned generative model of structure. Building on this, we propose an \textit{alternating optimization} framework that (1) fixes the sequence to guide structure generation toward high binding affinity using a structure-based predictor, then (2) applies inverse folding to create sequence mutations, refined by a sequence-based predictor. A key challenge is the lack of labeled data for training both predictors. To address this, we develop a \textit{co-teaching} module that incorporates valuable information from noisy biophysical energies into predictor refinement. The sequence-based predictor selects consensus samples to teach the structure-based predictor, and vice versa. Our method, \textit{AffinityFlow}, achieves state-of-the-art performance in proof-of-concept affinity maturation experiments.

Paperid:953 poster

Authors:Jinyang Liu, Tessa Steensgaard, Marvin N. Wright, Niklas Pfister, Munir Hiabu

Title: Fast Estimation of Partial Dependence Functions using Trees

Abstract:
Many existing interpretation methods are based on Partial Dependence (PD) functions that, for a pre-trained machine learning model, capture how a subset of the features affects the predictions by averaging over the remaining features. Notable methods include Shapley additive explanations (SHAP) which computes feature contributions based on a game theoretical interpretation and PD plots (i.e., 1-dim PD functions) that capture average marginal main effects. Recent work has connected these approaches using a functional decomposition and argues that SHAP values can be misleading since they merge main and interaction effects into a single local effect. However, a major advantage of SHAP compared to other PD-based interpretations has been the availability of fast estimation techniques, such asTreeSHAP. In this paper, we propose a new tree-based estimator,FastPD, which efficiently estimates arbitrary PD functions. We show thatFastPDconsistently estimates the desired population quantity -- in contrast to path-dependentTreeSHAPwhich is inconsistent when features are correlated. For moderately deep trees,FastPDimproves the complexity of existing methods from quadratic to linear in the number of observations. By estimating PD functions for arbitrary feature subsets,FastPDcan be used to extract PD-based interpretations such as SHAP, PD plots and higher-order interaction effects.

Paperid:954 poster

Authors:Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, Craig Boutilier

Title: Preference Adaptive and Sequential Text-to-Image Generation

Abstract:
We address the problem of interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest an adaptive and diverse slate of prompt expansions to the user. Our Preference Adaptive and Sequential Text-to-image Agent (PASTA) extends T2I models with adaptive multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also open-source our sequential rater dataset and simulated user-rater interactions to support future research in user-centric multi-turn T2I systems.

Paperid:955 poster

Authors:Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

Title: Diving into Self-Evolving Training for Multimodal Reasoning

Abstract:
Self-evolving training—where models iteratively learn from their own outputs—has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: $\textit{Training Method}$, $\textit{Reward Model}$, and $\textit{Prompt Variation}$. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal reasoning capabilities. Moreover, delving deeper into training dynamics, we uncover the roots of saturation and propose a new automatic balancing mechanism to mitigate this limitation. Building on these insights, we propose M-STaR (MultimodalSelf-evolvingTraining forReasoning), a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks. All resources will be made publicly available.

Paperid:956 poster

Authors:Aniket Murhekar, Jiaxin Song, Parnian Shahkar, Bhaskar Ray Chaudhury, Ruta Mehta

Title: You Get What You Give: Reciprocally Fair Federated Learning

Abstract:
Federated learning (FL) is a popular collaborative learning paradigm, whereby agents with individual datasets can jointly train an ML model. While higher data sharing improves model accuracy and leads to higher payoffs, it also raises costs associated with data acquisition or loss of privacy, causing agents to be strategic about their data contribution. This leads to undesirable behavior at a Nash equilibrium (NE) such asfree-riding, resulting in sub-optimal fairness, data sharing, and welfare. To address this, we design $\mathcal{M}^{Shap}$, a budget-balanced payment mechanism for FL, that admits Nash equilibria under mild conditions, and achievesreciprocal fairness: where each agent's payoff equals her contribution to the collaboration, as measured by the Shapley share. In addition to fairness, we show that the NE under $\mathcal{M}^{Shap}$ has desirable guarantees in terms of accuracy, welfare, and total data collected. We validate our theoretical results through experiments, demonstrating that $\mathcal{M}^{Shap}$ outperforms baselines in terms of fairness and efficiency.

Paperid:957 poster

Authors:Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Gopal Paliwal, Weili Nie, Arash Vahdat

Title: GenMol: A Drug Discovery Generalist with Discrete Diffusion

Abstract:
Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We presentGeneralist Molecular generative model(GenMol), a versatile framework that uses only asinglediscrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introducesfragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further proposemolecular context guidance(MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model inde novogeneration and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design.

Paperid:958 poster

Authors:Chinmaya Kausik, Kevin Tan, Ambuj Tewari

Title: Leveraging Offline Data in Linear Latent Contextual Bandits

Abstract:
Leveraging offline data is an attractive way to accelerate online sequential decision-making. However, it is crucial to account for latent states in users or environments in the offline data, and latent bandits form a compelling model for doing so. In this light, we design end-to-end latent bandit algorithms capable of handing uncountably many latent states. We focus on a linear latent contextual bandit �?a linear bandit where each user has its own high-dimensional reward parameter inRdA, but reward parameters across users lie in a low-rank latent subspace of dimensiondK≪dA. First, we provide an offline algorithm to learn this subspace with provable guarantees. We then present two online algorithms that utilize the output of this offline algorithm to accelerate online learning. The first enjoysO~(min(dAT,dKT(1+dAT/dKN)))regret guarantees, so that the effective dimension is lower when the sizeNof the offline dataset is larger. We prove a matching lower bound on regret, showing that our algorithm is minimax optimal. The second is a practical algorithm that enjoys only a slightly weaker guarantee, but is computationally efficient. We also establish the efficacy of our methods using experiments on both synthetic data and real-life movie recommendation data from MovieLens. Finally, we theoretically establish the generality of the latent bandit model by proving a de Finetti theorem for stateless decision processes.

Paperid:959 poster

Authors:Shantanu Acharya, Fei Jia, Boris Ginsburg

Title: Star Attention: Efficient LLM Inference over Long Sequences

Abstract:
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy.

Paperid:960 poster

Authors:Gholamali Aminian, Amir R. Asadi, Tian Li, Ahmad Beirami, Gesine Reinert, Samuel N. Cohen

Title: Generalization and Robustness of the Tilted Empirical Risk

Abstract:
The generalization error (risk) of a supervised statistical learning algorithm quantifies its prediction ability on previously unseen data. Inspired by exponential tilting, Li et al. (2021) proposed the {\it tilted empirical risk} (TER) as a non-linear risk metric for machine learning applications such as classification and regression problems. In this work, we examine the generalization error of the tilted empirical risk in the robustness regime under \textit{negative tilt}. Our first contribution is to provide uniform and information-theoretic bounds on the {\it tilted generalization error}, defined as the difference between the population risk and the tilted empirical risk, under negative tilt for unbounded loss function under bounded $(1+\epsilon)$-th moment of loss function for some $\epsilon\in(0,1]$ with a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ where $n$ is the number of training samples, revealing a novel application for TER under no distribution shift. Secondly, we study the robustness of the tilted empirical risk with respect to noisy outliers at training time and provide theoretical guarantees under distribution shift for the tilted empirical risk. We empirically corroborate our findings in simple experimental setups where we evaluate our bounds to select the value of tilt in a data-driven manner.

Paperid:961 poster

Authors:Justin Y. Chen, Vincent Cohen-Addad, Alessandro Epasto, Morteza Zadimoghaddam

Title: Scalable Private Partition Selection via Adaptive Weighting

Abstract:
In the differentially private partition selection problem (a.k.a. set union, key discovery), users hold subsets of items from an unbounded universe. The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy. Solutions to this problem are a core building block for many privacy-preserving ML applications including vocabulary extraction in a private corpus, computing statistics over categorical data and learning embeddings over user-provided items. We propose an algorithm for this problem, MaxAdaptiveDegree (MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight, thereby increasing the probability that less frequent items are output. Our algorithm can be efficiently implemented in massively parallel computation systems allowing scalability to very large datasets. We prove that our algorithm stochastically dominates the standard parallel algorithm for this problem. We also develop a two-round version of our algorithm, MAD2R, where results of the computation in the first round are used to bias the weighting in the second round to maximize the number of items output. In experiments, our algorithms provide the best results among parallel algorithms and scale to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed in prior works.

Paperid:962 poster

Authors:Martin Van Waerebeke, Marco Lorenzi, Giovanni Neglia, Kevin Scaman

Title: When to Forget? Complexity Trade-offs in Machine Unlearning

Abstract:
Machine Unlearning (MU) aims at removing the influence of specific data points from a trained model, striving to achieve this at a fraction of the cost of full model retraining. In this paper, we analyze the efficiency of unlearning methods and establish the first upper and lower bounds on minimax computation times for this problem, characterizing the performance of the most efficient algorithm against the most difficult objective function. Specifically, for strongly convex objective functions and under the assumption that the forget data is inaccessible to the unlearning method, we provide a phase diagram for theunlearning complexity ratio---a novel metric that compares the computational cost of the best unlearning method to full model retraining. The phase diagram reveals three distinct regimes: one where unlearning at a reduced cost is infeasible, another where unlearning is trivial because adding noise suffices, and a third where unlearning achieves significant computational advantages over retraining. These findings highlight the critical role of factors such as data dimensionality, the number of samples to forget, and privacy constraints in determining the practical feasibility of unlearning.

Paperid:963 poster

Authors:Hanlin Yu, Arto Klami, Aapo Hyvarinen, Anna Korba, Omar Chehab

Title: Density Ratio Estimation with Conditional Probability Paths

Abstract:
Density ratio estimation in high dimensions can be reframed as integrating a certain quantity, the time score, over probability paths which interpolate between the two densities. In practice, the time score has to be estimated based on samples from the two densities. However, existing methods for this problem remain computationally expensive and can yield inaccurate estimates. Inspired by recent advances in generative modeling, we introduce a novel framework for time score estimation, based on a conditioning variable. Choosing the conditioning variable judiciously enables a closed-form objective function. We demonstrate that, compared to previous approaches, our approach results in faster learning of the time score and competitive or better estimation accuracies of the density ratio on challenging tasks. Furthermore, we establish theoretical guarantees on the error of the estimated density ratio.

Paperid:964 poster

Authors:Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun MA, Chao Zhang

Title: Improving LLM Video Understanding with 16 Frames Per Second

Abstract:
Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS)�?, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (e.g., basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. We will release the source code, model checkpoints, and data athttps://github.com/bytedance/F-16.

Paperid:965 poster

Authors:Xiaojun Xu, Jinghan Jia, Yuanshun Yao, Yang Liu, Hang Li

Title: Robust Multi-bit Text Watermark with LLM-based Paraphrasers

Abstract:
We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLM-based evaluation.

Paperid:966 poster

Authors:Ziyan Li, Naoki Hiratani

Title: Optimal Task Order for Continual Learning of Multiple Tasks

Abstract:
Continual learning of multiple tasks remains a major challenge for neural networks. Here, we investigate how task order influences continual learning and propose a strategy for optimizing it. Leveraging a linear teacher-student model with latent factors, we derive an analytical expression relating task similarity and ordering to learning performance. Our analysis reveals two principles that hold under a wide parameter range: (1) tasks should be arranged from the least representative to the most typical, and (2) adjacent tasks should be dissimilar. We validate these rules on both synthetic data and real-world image classification datasets (Fashion-MNIST, CIFAR-10, CIFAR-100), demonstrating consistent performance improvements in both multilayer perceptrons and convolutional neural networks. Our work thus presents a generalizable framework for task-order optimization in task-incremental continual learning.

Paperid:967 poster

Authors:Rajat R Rasal, Avinash Kori, Fabio De Sousa Ribeiro, Tian Xia, Ben Glocker

Title: Diffusion Counterfactual Generation with Semantic Abduction

Abstract:
Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing auto-encoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To the best of our knowledge, ours is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.

Paperid:968 poster

Authors:Zhaoyi Zhou, Yuda Song, Andrea Zanette

Title: Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Abstract:
When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly—even for large tech companies—and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.

Paperid:969 poster

Authors:Jiaqi Leng, Bin Shi

Title: Quantum Optimization via Gradient-Based Hamiltonian Descent

Abstract:
With rapid advancements in machine learning, first-order algorithms have emerged as the backbone of modern optimization techniques, owing to their computational efficiency and low memory requirements. Recently, the connection between accelerated gradient methods and damped heavy-ball motion, particularly within the framework of Hamiltonian dynamics, has inspired the development of innovative quantum algorithms for continuous optimization. One such algorithm, Quantum Hamiltonian Descent (QHD), leverages quantum tunneling to escape saddle points and local minima, facilitating the discovery of global solutions in complex optimization landscapes. However, QHD faces several challenges, including slower convergence rates compared to classical gradient methods and limited robustness in highly non-convex problems due to the non-local nature of quantum states. Furthermore, the original QHD formulation primarily relies on function value information, which limits its effectiveness. Inspired by insights from high-resolution differential equations that have elucidated the acceleration mechanisms in classical methods, we propose an enhancement to QHD by incorporating gradient information, leading to what we call gradient-based QHD. This gradient-based QHD achieves faster convergence and significantly increases the likelihood of identifying global solutions. Numerical simulations on challenging problem instances demonstrate that this gradient-based QHD outperforms existing quantum and classical methods by at least an order of magnitude.

Paperid:970 poster

Authors:Mehrdad Ghadiri, Matthew Fahrbach, Yunbum Kook, Ali Jadbabaie

Title: Fast Tensor Completion via Approximate Richardson Iteration

Abstract:
We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods to solvehighly structuredlinear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is often lost in TC regression problems, making direct extensions unclear. This work proposes a novelliftingmethod for approximately solving TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We analyze the convergence rate of our approximate Richardson iteration-based algorithm, and our empirical study shows that it can be 100x faster than direct methods for CP completion on real-world tensors.

Paperid:971 poster

Authors:Qifang Zhao, Weidong Ren, Tianyu Li, Hong Liu, Xingsheng He, Xiaoxiao Xu

Title: GraphGPT: Generative Pre-trained Graph Eulerian Transformer

Abstract:
We introduceGraphGPT, a novel self-supervisedgenerative pre-trainedmodel for graph learning based on theGraph Eulerian Transformer(GET). First, we proposeGET, which combines a standard transformer encoder or decoder architecture with an innovative graph-to-sequence transformation method. This method converts graphs or sampled subgraphs into sequences of tokens representing nodes, edges, and attributes in a reversible manner using Eulerian paths. We pre-trainGETusing either of the two self-supervised tasks: next-token prediction (NTP) and scheduled masked-token prediction (SMTP). The pre-trained model is then fine-tuned for downstream tasks such as graph-, edge-, and node-level prediction. Despite its simplicity, GraphGPT achieves performance comparable to or surpassing state-of-the-art methods on multiple large-scale Open Graph Benchmark (OGB) datasets. It demonstrates exceptional results on the molecular property prediction dataset PCQM4Mv2 and the protein-protein interaction dataset ogbl-ppa. Notably, generative pre-training enables scaling GraphGPT to 2 billion parameters while maintaining performance gains �?a breakthrough that overcomes the scalability limitations of traditional Graph Neural Networks (GNNs) and prior graph transformers (GTs). To advance research in graph foundation models and facilitate scientific discovery in chemistry, materials science, and related fields, we have released the source code (https://github.com/alibaba/graph-gpt) and model checkpoints (https://www.modelscope.cn/organization/Alibaba-DT).

Paperid:972 poster

Authors:Lexiang Hu, Yisen Wang, Zhouchen Lin

Title: Incorporating Arbitrary Matrix Group Equivariance into KANs

Abstract:
Kolmogorov-Arnold Networks (KANs) have seen great success in scientific domains thanks to spline activation functions, becoming an alternative to Multi-Layer Perceptrons (MLPs). However, spline functions may not respect symmetry in tasks, which is crucial prior knowledge in machine learning. In this paper, we propose Equivariant Kolmogorov-Arnold Networks (EKAN), a method for incorporating arbitrary matrix group equivariance into KANs, aiming to broaden their applicability to more fields. We first construct gated spline basis functions, which form the EKAN layer together with equivariant linear weights, and then define a lift layer to align the input space of EKAN with the feature space of the dataset, thereby building the entire EKAN architecture. Compared with baseline models, EKAN achieves higher accuracy with smaller datasets or fewer parameters on symmetry-related tasks, such as particle scattering and the three-body problem, often reducing test MSE by several orders of magnitude. Even in non-symbolic formula scenarios, such as top quark tagging with three jet constituents, EKAN achieves comparable results with state-of-the-art equivariant architectures using fewer than40%of the parameters, while KANs do not outperform MLPs as expected. Code and data are available athttps://github.com/hulx2002/EKAN.

Paperid:973 poster

Authors:Peijia Qin, Jianguo Zhang

Title: Learning to Quantize for Training Vector-Quantized Networks

Abstract:
Deep neural networks incorporating discrete latent variables have shown significant potential in sequence modeling. A notable approach is to leverage vector quantization (VQ) to generate discrete representations within a codebook. However, its discrete nature prevents the use of standard backpropagation, which has led to challenges in efficient codebook training. In this work, we introduceMeta-Quantization (MQ), a novel vector quantization training framework inspired by meta-learning. Our method separates the optimization of the codebook and the auto-encoder into two levels. Furthermore, we introduce a hyper-net to replace the embedding-parameterized codebook, enabling the codebook to be dynamically generated based on the feedback from the auto-encoder. Different from previous VQ objectives, our innovation results in a meta-objective that makes the codebook training task-aware. We validate the effectiveness of MQ with VQVAE and VQGAN architecture on image reconstruction and generation tasks. Experimental results showcase the superior generative performance of MQ, underscoring its potential as a robust alternative to existing VQ methods.

Paperid:974 poster

Authors:Su Hyeong Lee, Manzil Zaheer, Tian Li

Title: Efficient Distributed Optimization under Heavy-Tailed Noise

Abstract:
Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization and novel clipping techniques. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with local updates and potentially unbounded gradient variance. Among its variants, we propose a memory- and communication-efficient instantiation (namedBi2Clip) that performs coordinate-wise clipping from both above and below at both the inner and outer optimizers.Bi2Clipbrings about benefits of adaptive optimization (e.g., Adam) without the cost of maintaining or transmitting additional gradient statistics. Empirically, TailOPT, includingBi2Clip, demonstrates superior performance on various tasks and models compared with state-of-the-art methods, while being more efficient.

Paperid:975 poster

Authors:Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, Yi Ma

Title: Attention-Only Transformers via Unrolled Subspace Denoising

Abstract:
Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.

Paperid:976 poster

Authors:Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammed Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Dov Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, et al. (6 additional authors not shown)

Title: Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Abstract:
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We presentPuzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization.We showcase our framework’s impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98.4% of the original model's benchmark accuracies. These are the most accurate models supporting single H100 GPU inference with large batch sizes, despite training on 45B tokens at most, far fewer than the 15T used to train Llama-70B. Lastly, we show that lightweight alignment on these derived models allows them to surpass the parent model in specific capabilities. Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection.

Paperid:977 poster

Authors:Jian Liu, Haohan Weng, Biwen Lei, Xianghui Yang, Zibo Zhao, Zhuo Chen, Song Guo, Tao Han, Chunchao Guo

Title: FreeMesh: Boosting Mesh Generation with Coordinates Merging

Abstract:
The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods. Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes into sequences. In this paper, we introduce a new metric Per-Token-Mesh-Entropy (PTME) to evaluate the existing mesh tokenizers theoretically without any training. Building upon PTME, we propose a plug-and-play tokenization technique called coordinate merging. It further improves the compression ratios of existing tokenizers by rearranging and merging the most frequent patterns of coordinates. Through experiments on various tokenization methods like MeshXL, MeshAnything V2, and Edgerunner, we further validate the performance of our method. We hope that the proposed PTME and coordinate merging can enhance the existing mesh tokenizers and guide the further development of native mesh generation.

Paperid:978 poster

Authors:Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu

Title: Efficient Motion Prompt Learning for Robust Visual Tracking

Abstract:
Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available athttps://github.com/zj5559/Motion-Prompt-Tracking.

Paperid:979 poster

Authors:Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He

Title: Action Dubber: Timing Audible Actions via Inflectional Flow

Abstract:
We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we proposeTA2Net, a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input.TA2Netalso integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization accuracy and simultaneously identifies sound sources within video frames. To support this task, we introduce a new benchmark dataset,Audible623, derived from Kinetics and UCF101 by removing non-essential vocalization subsets. Extensive experiments confirm the effectiveness of our approach onAudible623and show strong generalizability to other domains, such as repetitive counting and sound source localization. Code and dataset are available athttps://github.com/WenlongWan01/Audible623.

Paperid:980 poster

Authors:Fangming Cui, Yonggang Zhang, Xuan Wang, Xinmei Tian, Jun Yu

Title: Enhancing Target-unspecific Tasks through a Features Matrix

Abstract:
Recent developments in prompt learning of large Vision-Language Models (VLMs) have significantly improved performance in target-specific tasks. However, these prompting methods often struggle to tackle the target-unspecific or generalizable tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge. The general knowledge has a strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks (base-to-novel generalization, domain generalization, and cross-dataset generalization), achieving state-of-the-art performance.

Paperid:981 poster

Authors:Binghui Peng, Aviad Rubinstein

Title: A Near Linear Query Lower Bound for Submodular Maximization

Abstract:
We revisit the problem of selectingk-out-of-nelements with the goal of optimizing an objective function, and ask whether it can be solved approximately with sublinear query complexity. For objective functions that are monotone submodular, [Li, Feldman, Kazemi, Karbasi, NeurIPS'22; Kuhnle, AISTATS'21] gave anΩ(n/k)query lower bound for approximating to within any constant factor. We strengthen their lower bound to a nearly tightΩ~(n). This lower bound holds even for estimating the value of the optimal subset. When the objective function is additive, we prove that finding an approximately optimal subset still requires near-linear query complexity, but we can estimate the value of the optimal subset inO~(n/k)queries, and that this is tight up to polylog factors.

Paperid:982 poster

Authors:Andrei Ioan Muresanu, Anvith Thudi, Michael R. Zhang, Nicolas Papernot

Title: Fast Exact Unlearning for In-Context Learning Data for LLMs

Abstract:
Modern machine learning models are expensive to train, and there is a growing concern about the challenge of retroactively removing specific training data. Achieving exact unlearning in deep learning pipelines—producing models as if certain data had never been included in training—remains an open problem. In this paper, we revisit exact unlearning in deep learning and show that for large language models (LLMs) we can efficiently exactly unlearn ``fine-tuning data" (the data used to adapt a pre-trained model). This follows from two observations. First, we can use in-context learning to adapt the LLM to the fine-tuning dataset instead of SGD based algorithms. Second, we show that accurate in-context learning can be done with quantized k-means, which allows for effectively constant time unlearning operations. Our evaluation shows that this unlearning recipe has similar performance to fine-tuning alternatives, but vastly reduces the unlearning costs. Our study also highlights the need for new measures of unlearning cost when adapting the learning algorithm to have faster unlearn operations.

Paperid:983 poster

Authors:François Bachoc, Tommaso Cesari, Roberto Colomboni

Title: A Parametric Contextual Online Learning Theory of Brokerage

Abstract:
We study the role of contextual information in the online learning problem of brokerage between traders. In this sequential problem, at each time step, two traders arrive with secret valuations about an asset they wish to trade. The learner (a broker) suggests a trading (or brokerage) price based on contextual data about the asset and the market conditions. Then, the traders reveal their willingness to buy or sell based on whether their valuations are higher or lower than the brokerage price. A trade occurs if one of the two traders decides to buy and the other to sell, i.e., if the broker's proposed price falls between the smallest and the largest of their two valuations. We design algorithms for this problem and prove optimal theoretical regret guarantees under various standard assumptions.

Paperid:984 poster

Authors:Leello Tadesse Dadi, Volkan Cevher

Title: Generalization of noisy SGD in unbounded non-convex settings

Abstract:
We study generalization of iterative noisy gradient schemes on smooth non-convex losses. Formally, we establish time-independent information theoretic generalization bounds for Stochastic Gradient Langevin Dynamics (SGLD) that do not diverge as the iteration count increases. Our bounds are obtained through a stability argument: we analyze the difference between two SGLD sequences ran in parallel on two datasets sampled from the same distribution. Our result only requires an isoperimetric inequality to hold, which is merely a restriction on the tails of the loss. Our work relaxes the assumptions of prior work to establish that the iterates stay within a bounded KL divergence from each other. Under an additional dissipativity assumption, we show that the stronger Renyi divergence also stays bounded by establishing a uniform log-Sobolev constant of the iterates. Without dissipativity, we sidestep the need for local log-Sobolev inequalities and instead exploit the regularizing properties of Gaussian convolution. These techniques allow us to show that strong convexity is not necessary for finite stability bounds. Our work shows that noisy SGD can have finite, iteration-independent, generalization and differential privacy bounds in unbounded non-convex settings.

Paperid:985 poster

Authors:Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang

Title: SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Abstract:
With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V’s ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.

Paperid:986 poster

Authors:Martin Andrews, Sam Witteveen

Title: A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Abstract:
Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and ‘wordplay�?thatprovesthat the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.

Paperid:987 poster

Authors:Bo Zhao, Nima Dehmamy, Robin Walters, Rose Yu

Title: Understanding Mode Connectivity via Parameter Space Symmetry

Abstract:
Neural network minima are often connected by curves along which train and test loss remain nearly constant, a phenomenon known as mode connectivity. While this property has enabled applications such as model merging and fine-tuning, its theoretical explanation remains unclear. We propose a new approach to exploring the connectedness of minima using parameter space symmetry. By linking the topology of symmetry groups to that of the minima, we derive the number of connected components of the minima of linear networks and show that skip connections reduce this number. We then examine when mode connectivity and linear mode connectivity hold or fail, using parameter symmetries which account for a significant part of the minimum. Finally, we provide explicit expressions for connecting curves in the minima induced by symmetry. Using the curvature of these curves, we derive conditions under which linear mode connectivity approximately holds. Our findings highlight the role of continuous symmetries in understanding the neural network loss landscape.

Paperid:988 poster

Authors:William Huey, Huaxiaoyue Wang, Anne Wu, Yoav Artzi, Sanjiban Choudhury

Title: Imitation Learning from a Single Temporally Misaligned Video

Abstract:
We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. The project website is athttps://portal-cornell.github.io/orca/

Paperid:989 poster

Authors:Yifan Hou, Buse Giledereli, Yilei Tu, Mrinmaya Sachan

Title: Do Vision-Language Models Really Understand Visual Language?

Abstract:
Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of six LVLMs shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

Paperid:990 poster

Authors:Jacqueline R. M. A. Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V. Nori, Javier Gonzalez

Title: Compositional Causal Reasoning Evaluation in Language Models

Abstract:
Causal reasoning and compositional reasoning are two core aspirations in AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termedcompositional causal reasoning(CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate CCR evaluation for language models in the Llama, Phi, and GPT families. On a math word problem, our framework revealed a range of taxonomically distinct error patterns. CCR errors increased with the complexity of causal paths for all models except o1.

Paperid:991 poster

Authors:Johnny Xi, Hugh Dance, Peter Orbanz, Benjamin Bloem-Reddy

Title: Distinguishing Cause from Effect with Causal Velocity Models

Abstract:
Bivariate structural causal models (SCM) are often used to infer causal direction by examining their goodness-of-fit under restricted model classes. In this paper, we describe a parametrization of bivariate SCMs in terms of acausal velocityby viewing the cause variable as time in a dynamical system. The velocity implicitly defines counterfactual curves via the solution of initial value problems where the observation specifies the initial condition. Using tools from measure transport, we obtain a unique correspondence between SCMs and the score function of the generated distribution via its causal velocity. Based on this, we derive an objective function that directly regresses the velocity against the score function, the latter of which can be estimated non-parametrically from observational data. We use this to develop a method for bivariate causal discovery that extends beyond known model classes such as additive or location-scale noise, and that requires no assumptions on the noise distributions. When the score is estimated well, the objective is also useful for detecting model non–identifiability and misspecification. We present positive results in simulation and benchmark experiments where many existing methods fail, and perform ablation studies to examine the method's sensitivity to accurate score estimation.

Paperid:992 poster

Authors:Xin Huang, Eric M Wolff, Paul Vernaza, Tung Phan-Minh, Hongge Chen, David S Hayden, Mark Edmonds, Brian Pierce, Xinxin Chen, Pratik Elias Jacob, Xiaobai Chen, Chingiz Tairbekov, Pratik Agarwal, Tianshi Gao, Yuning Chai, Siddhartha Srinivasa

Title: DriveGPT: Scaling Autoregressive Behavior Models for Driving

Abstract:
We present DriveGPT, a scalable behavior model for autonomous driving. We model driving as a sequential decision-making task, and learn a transformer model to predict future agent states as tokens in an autoregressive fashion. We scale up our model parameters and training data by multiple orders of magnitude, enabling us to explore the scaling properties in terms of dataset size, model parameters, and compute. We evaluate DriveGPT across different scales in a planning task, through both quantitative metrics and qualitative examples, including closed-loop driving in complex real-world scenarios. In a separate prediction task, DriveGPT outperforms state-of-the-art baselines and exhibits improved performance by pretraining on a large-scale dataset, further validating the benefits of data scaling.

Paperid:993 poster

Authors:Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, Gerhard Neumann

Title: DIME: Diffusion-Based Maximum Entropy Reinforcement Learning

Abstract:
Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges—primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.

Paperid:994 poster

Authors:Osman Berke Guney, Ketan Suhaas Saichandran, Karim Elzokm, Ziming Zhang, Vijaya B Kolachalama

Title: Active feature acquisition via explainability-driven ranking

Abstract:
In many practical applications, including medicine, acquiring all relevant data for machine learning models is often infeasible due to constraints on time, cost, and resources. This makes it important to selectively acquire only the most informative features, yet traditional static feature selection methods fall short in scenarios where feature importance varies across instances. Here, we propose an active feature acquisition (AFA) framework, which dynamically selects features based on their importance to each individual case. Our method leverages local explanation techniques to generate instance-specific feature importance rankings. We then reframe the AFA problem as a feature prediction task, introducing a policy network grounded in a decision transformer architecture. This policy network is trained to select the next most informative feature by learning from the feature importance rankings. As a result, features are acquired sequentially, ordered by their predictive significance, leading to more efficient feature selection and acquisition. Extensive experiments on multiple datasets demonstrate that our approach outperforms current state-of-the-art AFA methods in both predictive accuracy and feature acquisition efficiency. These findings highlight the promise of an explainability-driven AFA strategy in scenarios where the cost of feature acquisition is a key concern.

Paperid:995 poster

Authors:Lucas Torroba Hennigen, Hunter Lang, Han Guo, Yoon Kim

Title: On the Duality between Gradient Transformations and Adapters

Abstract:
We study memory-efficient optimization of neural networks (in particular language models) withlinear gradient transformations, where the gradients are linearly mapped to a lower dimensional space than the full parameter space, thus saving memory required for gradient accumulation and optimizer state persistence. The model parameters are updated by first performing an optimization step in the lower dimensional space and then going back into the original parameter space via the linear map's transpose. We show that optimizing the model in this transformed space is equivalent to reparameterizing the original model through alinear adapterthat additively modifies the model parameters, and then only optimizing the adapter's parameters. When the transformation is Kronecker-factored, this establishes an equivalence between GaLore and one-sided LoRA. We show that this duality between gradient transformations and adapter-based reparameterizations unifies existing approaches to memory-efficient training and suggests new techniques for improving training efficiency and memory use.

Paperid:996 poster

Authors:Max Ruiz Luyten, Antonin Berthon, Mihaela van der Schaar

Title: Strategic Planning: A Top-Down Approach to Option Generation

Abstract:
Real-world human decision-making often relies on strategic planning, wherehigh-levelgoals guide the formulation of sub-goals and subsequent actions, as evidenced by domains such as healthcare, business, and urban policy. Despite notable successes in controlled settings, conventional reinforcement learning (RL) follows abottom-upparadigm, which can struggle to adapt to real-world complexities such as sparse rewards and limited exploration budgets. While methods like hierarchical RL and environment shaping provide partial solutions, they frequently rely on either ad-hoc designs (e.g. choose the set of high-level actions) or purely data-driven discovery of high-level actions that still requires significant exploration. In this paper, we introduce atop-downframework for RL that explicitly leverageshuman-like strategyto reduce sample complexity, guide exploration, and enable high-level decision-making. We first formalize theStrategy Problem, which frames policy generation as finding distributions over policies that balancespecificityandvalue. Building on this definition, we propose theStrategistagent—an iterative framework that leverages large language models (LLMs) to synthesize domain knowledge into a structured representation of actionable strategies and sub-goals. We further develop areward shaping methodologythat translates these strategies expressed in natural language into quantitative feedback for RL methods. Empirically, we demonstrate a significantly faster convergence than conventional PPO. Taken together, our findings highlight thattop-down strategic explorationopens new avenues for enhancing RL on real-world decision problems.

Paperid:997 poster

Authors:Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, Graham Horwood

Title: Active Evaluation Acquisition for Efficient LLM Benchmarking

Abstract:
As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.

Paperid:998 poster

Authors:Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang

Title: Maximum Entropy Reinforcement Learning with Diffusion Policy

Abstract:
The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available athttps://github.com/diffusionyes/MaxEntDP.

Paperid:999 poster

Authors:Keitaro Sakamoto, Issei Sato

Title: Benign Overfitting in Token Selection of Attention Mechanism

Abstract:
Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism, under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves ``benign overfitting'', i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.

Paperid:1000 poster

Authors:Jiayuan Zhang, Xuefeng Liu, Jianwei Niu, Shaojie Tang, Haotian Yang, Xinghao Wu

Title: Causality Inspired Federated Learning for OOD Generalization

Abstract:
The out-of-distribution (OOD) generalization problem in federated learning (FL) has recently attracted significant research interest. A common approach, derived from centralized learning, is to extract causal features which exhibit causal relationships with the label. However, in FL, the global feature extractor typically captures only invariant causal features shared across clients and thus discards many other causal features that are potentially useful for OOD generalization. To address this problem, we propose FedUni, a simple yet effective architecture trained to extract all possible causal features from any input. FedUni consists of a comprehensive feature extractor, designed to identify a union of all causal feature types in the input, followed by a feature compressor, which discards potential \textit{inactive} causal features. With this architecture, FedUni can benefit from collaborative training in FL while avoiding the cost of model aggregation (i.e., extracting only invariant features). In addition, to further enhance the feature extractor's ability to capture causal features, FedUni add a causal intervention module on the client side, which employs a counterfactual generator to generate counterfactual examples that simulate distributions shifts. Extensive experiments and theoretical analysis demonstrate that our method significantly improves OOD generalization performance.

Paperid:1001 poster

Authors:Jake Tuero, Michael Buro, Levi Lelis

Title: Subgoal-Guided Policy Heuristic Search with Learned Subgoals

Abstract:
Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.

Paperid:1002 poster

Authors:Haitong Ma, Tianyi Chen, Kai Wang, Na Li, Bo Dai

Title: Efficient Online Reinforcement Learning for Diffusion Policy

Abstract:
Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.

Paperid:1003 poster

Authors:Jingyuan Wang, Zhimei Ren, Ruohan Zhan, Zhengyuan Zhou

Title: Distributionally Robust Policy Learning under Concept Drifts

Abstract:
Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-casejointdistribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem --- robust policy learning under theconcept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-$n$ rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class $\Pi$, and show that the sub-optimality gap of the proposed algorithm is of the order $\kappa(\Pi)n^{-1/2}$, where $\kappa(\Pi)$ is the entropy integral of $\Pi$ under the Hamming distance and $n$ is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.

Paperid:1004 poster

Authors:Juan Carlos Perdomo

Title: Revisiting the Predictability of Performative, Social Events

Abstract:
Social predictions do not passively describe the future; they actively shape it. They inform actions and change individual expectations in ways that influence the likelihood of the predicted outcome. Given these dynamics, to what extent can social events be predicted? This question was discussed throughout the 20th century by authors like Merton, Morgenstern, Simon, and others who considered it a central issue in social science methodology. In this work, we provide a modern answer to this old problem. Using recent ideas from performative prediction and outcome indistinguishability, we establish that one can always efficiently predict social events accurately, regardless of how predictions influence data. While achievable, we also show that these predictions are often undesirable, highlighting the limitations of previous desiderata. We end with a discussion of various avenues forward.

Paperid:1005 poster

Authors:Zhaoyu Zhang, Yang Hua, Guanxiong Sun, Hui Wang, Seán McLoone

Title: Training Diffusion-based Generative Models with Limited Data

Abstract:
Diffusion-based generative models (diffusion models) often require a large amount of data to train a score-based model that learns the score function of the data distribution through denoising score matching. However, collecting and cleaning such data can be expensive, time-consuming, and even infeasible. In this paper, we present a novel theoretical insight for diffusion models that two factors, i.e., the denoiser function hypothesis space and the number of training samples, can affect the denoising score matching error of all training samples. Based on this theoretical insight, it is evident that minimizing the total denoising score matching error is challenging within the denoiser function hypothesis space in existing methods, when training diffusion models with limited data. To address this, we propose a new diffusion model called Limited Data Diffusion (LD-Diffusion), which consists of two main components: a compressing model and a novel mixed augmentation with fixed probability (MAFP) strategy. Specifically, the compressing model can constrain the complexity of the denoiser function hypothesis space and MAFP can effectively increase the training samples by providing more informative guidance than existing data augmentation methods in the compressed hypothesis space. Extensive experiments on several datasets demonstrate that LD-Diffusion can achieve better performance compared to other diffusion models. Codes are available athttps://github.com/zzhang05/LD-Diffusion.

Paperid:1006 poster

Authors:Shengyu Feng, Yiming Yang

Title: Regularized Langevin Dynamics for Combinatorial Optimization

Abstract:
This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available athttps://github.com/Shengyu-Feng/RLD4CO.

Paperid:1007 poster

Authors:Jennifer Rogers Brennan, Sebastien Lahaie, Adel Javanmard, Nick Doudchenko, Jean Pouget-Abadie

Title: Integer Programming for Generalized Causal Bootstrap Designs

Abstract:
In experimental causal inference, we distinguish between two sources of uncertainty: design uncertainty, due to the treatment assignment mechanism, and sampling uncertainty, when the sample is drawn from a super-population. This distinction matters in settings with small fixed samples and heterogeneous treatment effects, as in geographical experiments. The standard bootstrap procedure most often used by practitioners primarily estimates sampling uncertainty, and the causal bootstrap procedure, which accounts for design uncertainty, was developed for the completely randomized design and the difference-in-means estimator, whereas non-standard designs and estimators are often used in these low-power regimes. We address this gap by proposing an integer program which computes numerically the worst-case copula used as an input to the causal bootstrap method, in a wide range of settings. Specifically, we prove the asymptotic validity of our approach for unconfounded, conditionally unconfounded, and and individualistic with bounded confoundedness assignments, as well as generalizing to any linear-in-treatment and quadratic-in-treatment estimators. We demonstrate the refined confidence intervals achieved through simulations of small geographical experiments.

Paperid:1008 poster

Authors:Eric Elmoznino, Thomas Jiralerspong, Yoshua Bengio, Guillaume Lajoie

Title: Towards a Formal Theory of Representational Compositionality

Abstract:
Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought and language. In AI, it enables a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, we lack satisfying formal definitions for it. Here, we propose such a definition called representational compositionality that is conceptually simple, quantitative, and grounded in algorithmic information theory. Intuitively, representational compositionality states that a compositional representation is both expressive and describable as a simple function of parts. We validate our definition on both real and synthetic data, and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We hope that our definition can inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought. We make our code available athttps://github.com/EricElmoznino/complexity_compositionality.

Paperid:1009 poster

Authors:Nikola Milosevic, Johannes Müller, Nico Scherf

Title: Embedding Safety into RL: A New Take on Trust Region Methods

Abstract:
Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.

Paperid:1010 poster

Authors:Rickmer Schulte, David Rügamer, Thomas Nagler

Title: Adjustment for Confounding using Pre-Trained Representations

Abstract:
There is growing interest in extending average treatment effect (ATE) estimation to incorporate non-tabular data, such as images and text, which may act as sources of confounding. Neglecting these effects risks biased results and flawed scientific conclusions. However, incorporating non-tabular data necessitates sophisticated feature extractors, often in combination with ideas of transfer learning. In this work, we investigate how latent features from pre-trained neural networks can be leveraged to adjust for sources of confounding. We formalize conditions under which these latent features enable valid adjustment and statistical inference in ATE estimation, demonstrating results along the example of double machine learning. We discuss critical challenges inherent to latent feature learning and downstream parameter estimation arising from the high dimensionality and non-identifiability of representations. Common structural assumptions for obtaining fast convergence rates with additive or sparse linear models are shown to be unrealistic for latent features. We argue, however, that neural networks are largely insensitive to these issues. In particular, we show that neural networks can achieve fast convergence rates by adapting to intrinsic notions of sparsity and dimension of the learning problem.

Paperid:1011 poster

Authors:Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, Zongqing Lu

Title: Scaling Large Motion Models with Million-Level Human Motions

Abstract:
Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted toward developing large motion models. Despite some progress, current efforts remain far from achieving truly generalist models, primarily due to the lack of massive high-quality data. To address this gap, we present MotionLib, the first million-level dataset for motion generation, which is at least 15$\times$ larger than existing counterparts and enriched with hierarchical text descriptions. Using MotionLib, we train a large motion model named Being-M0, demonstrating robust performance across a wide range of human activities, including unseen ones. Through systematic investigation, for the first time, we highlight the importance of scaling both data and model size for advancing motion generation, along with key insights to achieve this goal. To better integrate the motion modality, we propose Motionbook, an innovative motion encoding approach including (1) a compact yet lossless feature to represent motions; (2) a novel 2D lookup-free motion tokenizer that preserves fine-grained motion details while expanding codebook capacity, significantly enhancing the representational power of motion tokens. We believe this work lays the groundwork for developing more versatile and powerful motion generation models in the future. For further details, visithttps://beingbeyond.github.io/Being-M0/.

Paperid:1012 poster

Authors:Wangkai Li, Rui Sun, Bohao Liao, Zhaoyang Li, Tianzhu Zhang

Title: Balanced Learning for Domain Adaptive Semantic Segmentation

Abstract:
Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Despite the effectiveness of self-training techniques in UDA, they struggle to learn each class in a balanced manner due to inherent class imbalance and distribution shift in both data and label space between domains. To address this issue, we propose Balanced Learning for Domain Adaptation (BLDA), a novel approach to directly assess and alleviate class bias without requiring prior knowledge about the distribution shift. First, we identify over-predicted and under-predicted classes by analyzing the distribution of predicted logits. Subsequently, we introduce a post-hoc approach to align the logits distributions across different classes using shared anchor distributions. To further consider the network's need to generate unbiased pseudo-labels during self-training, we estimate logits distributions online and incorporate logits correction terms into the loss function. Moreover, we leverage the resulting cumulative density as domain-shared structural knowledge to connect the source and target domains. Extensive experiments on two standard UDA semantic segmentation benchmarks demonstrate that BLDA consistently improves performance, especially for under-predicted classes, when integrated into various existing methods.

Paperid:1013 poster

Authors:Junbiao Cui, Qin Yue, Jianqing Liang, Jiye Liang

Title: Human Cognition-Inspired Hierarchical Fuzzy Learning Machine

Abstract:
Classification is a cornerstone of machine learning research. Most of the existing classifiers assume that the concepts corresponding to classes can be precisely defined. This notion diverges from the widely accepted understanding in cognitive science, which posits that real-world concepts are often inherently ambiguous. To bridge this big gap, we propose a Human Cognition-Inspired Hierarchical Fuzzy Learning Machine (HC-HFLM), which leverages a novel hierarchical alignment loss to integrate rich class knowledge from human knowledge system into learning process. We further theoretically prove that minimizing this loss can align the hierarchical structure derived from data with those contained in class knowledge, resulting in clear semantics and high interpretability. Systematic experiments verify that the proposed method can achieve significant gains in interpretability and generalization performance.

Paperid:1014 poster

Authors:Adrian Müller, Jon Schneider, Stratis Skoulakis, Luca Viano, Volkan Cevher

Title: Best of Both Worlds: Regret Minimization versus Minimax Play

Abstract:
In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $\tilde{O}(\sqrt{T})$ regret compared to any fixed strategy, where $T$ is the number of rounds. We provide the first affirmative answer to this question whenever the comparator strategy supports every action. In the context of zero-sum games with min-max value zero, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $\Omega(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.

Paperid:1015 poster

Authors:Vincent Plassier, Alexander Fishkov, Victor Dheur, Mohsen Guizani, Souhaib Ben Taieb, Maxim Panov, Eric Moulines

Title: Rectifying Conformity Scores for Better Conditional Coverage

Abstract:
We present a new method for generating confidence sets within the split conformal prediction framework. Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage. The transformation is based on an estimate of the conditional quantile of conformity scores. The resulting method is particularly beneficial for constructing adaptive confidence sets in multi-output problems where standard conformal quantile regression approaches have limited applicability. We develop a theoretical bound that captures the influence of the accuracy of the quantile estimate on the approximate conditional validity, unlike classical bounds for conformal prediction methods that only offer marginal coverage. We experimentally show that our method is highly adaptive to the local data structure and outperforms existing methods in terms of conditional coverage, improving the reliability of statistical inference in various applications.

Paperid:1016 poster

Authors:Jiawei Tang, Yuheng Jia

Title: Concentration Distribution Learning from Label Distributions

Abstract:
Label distribution learning (LDL) is an effective method to predict the relative label description degree (a.k.a. label distribution) of a sample. However, the label distribution is not a complete representation of an instance because it overlooks the absolute intensity of each label. Specifically, it's impossible to obtain the total description degree of hidden labels that not in the label space, which leads to the loss of information and confusion in instances. To solve the above problem, we come up with a new concept named background concentration to serve as the absolute description degree term of the label distribution and introduce it into the LDL process, forming the improved paradigm of concentration distribution learning. Moreover, we propose a novel model by probabilistic methods and neural networks to learn label distributions and background concentrations from existing LDL datasets. Extensive experiments prove that the proposed approach is able to extract background concentrations from label distributions while producing more accurate prediction results than the state-of-the-art LDL methods. The code is available inhttps://github.com/seutjw/CDL-LD.

Paperid:1017 poster

Authors:Haoyang Luo, Linwei Tao, Minjing Dong, Chang Xu

Title: Beyond One-Hot Labels: Semantic Mixing for Model Calibration

Abstract:
Model calibration seeks to ensure that models produce confidence scores that accurately reflect the true likelihood of their predictions being correct. However, existing calibration approaches are fundamentally tied to datasets of one-hot labels implicitly assuming full certainty in all the annotations. Such datasets are effective for classification but provides insufficient knowledge of uncertainty for model calibration, necessitating the curation of datasets with numerically rich ground-truth confidence values. However, due to the scarcity of uncertain visual examples, such samples are not easily available as real datasets. In this paper, we introduce calibration-aware data augmentation to create synthetic datasets of diverse samples and their ground-truth uncertainty. Specifically, we presentCalibration-aware Semantic Mixing (CSM), a novel framework that generates training samples with mixed class characteristics and annotates them with distinct confidence scores via diffusion models. Based on this framework, we propose calibrated reannotation to tackle the misalignment between the annotated confidence score and the mixing ratio during the diffusion reverse process. Besides, we explore the loss functions that better fit the new data representation paradigm. Experimental results demonstrate that CSM achieves superior calibration compared to the state-of-the-art calibration approaches. Our code isavailable here.

Paperid:1018 poster

Authors:Xiangchao Li, Xiao Han, Qing Yang

Title: Eigen Analysis of Conjugate Kernel and Neural Tangent Kernel

Abstract:
In this paper, we investigate deep feedforward neural networks with random weights. The input data matrixXis drawn from a Gaussian mixture model. We demonstrate that certain eigenvalues of the conjugate kernel and neural tangent kernel may lie outside the support of their limiting spectral measures in the high-dimensional regime. The existence and asymptotic positions of such isolated eigenvalues are rigorously analyzed. Furthermore, we provide a precise characterization of the entrywise limit of the projection matrix onto the eigenspace associated with these isolated eigenvalues. Our findings reveal that the eigenspace captures inherent group features present inX. This study offers a quantitative analysis of how group features from the input data evolve through hidden layers in randomly weighted neural networks.

Paperid:1019 poster

Authors:Stas Syrota, Yevgen Zainchkovskyy, Johnny Xi, Benjamin Bloem-Reddy, Søren Hauberg

Title: Identifying Metric Structures of Deep Latent Variable Models

Abstract:
Deep latent variable models learn condensed representations of data that, hopefully, reflect the inner workings of the studied phenomena. Unfortunately, these latent representations are not statistically identifiable, meaning they cannot be uniquely determined. Domain experts, therefore, need to tread carefully when interpreting these. Current solutions limit the lack of identifiability through additional constraints on the latent variable model, e.g. by requiring labeled training data, or by restricting the expressivity of the model. We change the goal: instead of identifying the latent variables, we identify relationships between them such as meaningful distances, angles, and volumes. We prove this is feasible under very mild model conditions and without additional labeled data. We empirically demonstrate that our theory results in more reliable latent distances, offering a principled path forward in extracting trustworthy conclusions from deep latent variable models.

Paperid:1020 poster

Authors:Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan

Title: PaperBench: Evaluating AI’s Ability to Replicate AI Research

Abstract:
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.

Paperid:1021 poster

Authors:Lexiang Hu, Yikang Li, Zhouchen Lin

Title: Explicit Discovery of Nonlinear Symmetries from Dynamic Data

Abstract:
Symmetry is widely applied in problems such as the design of equivariant networks and the discovery of governing equations, but in complex scenarios, it is not known in advance. Most previous symmetry discovery methods are limited to linear symmetries, and recent attempts to discover nonlinear symmetries fail to explicitly get the Lie algebra subspace. In this paper, we propose LieNLSD, which is, to our knowledge, the first method capable of determining the number of infinitesimal generators with nonlinear terms and their explicit expressions. We specify a function library for the infinitesimal group action and aim to solve for its coefficient matrix, proving that its prolongation formula for differential equations, which governs dynamic data, is also linear with respect to the coefficient matrix. By substituting the central differences of the data and the Jacobian matrix of the trained neural network into the infinitesimal criterion, we get a system of linear equations for the coefficient matrix, which can then be solved using SVD. On top quark tagging and a series of dynamic systems, LieNLSD shows qualitative advantages over existing methods and improves the long rollout accuracy of neural PDE solvers by over20%while applying to guide data augmentation. Code and data are available athttps://github.com/hulx2002/LieNLSD.

Paperid:1022 poster

Authors:Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han

Title: XAttention: Block Sparse Attention with Antidiagonal Scoring

Abstract:
Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks—including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation—XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications.

Paperid:1023 poster

Authors:Yun Hua, Haosheng Chen, Wenhao Li, Bo Jin, Baoxiang Wang, Hongyuan Zha, Xiangfeng Wang

Title: Reward Translation via Reward Machine in Semi-Alignable MDPs

Abstract:
Addressing reward design complexities in deep reinforcement learning is facilitated by knowledge transfer across different domains. To this end, we define \textit{reward translation} to describe the cross-domain reward transfer problem. However, current methods struggle with non-pairable and non-time-alignable incompatible MDPs. This paper presents an adaptable reward translation framework \textit{neural reward translation} featuring \textit{semi-alignable MDPs}, which allows efficient reward translation under relaxed constraints while handling the intricacies of incompatible MDPs. Given the inherent difficulty of directly mapping semi-alignable MDPs and transferring rewards, we introduce an indirect mapping method through reward machines, created using limited human input or LLM-based automated learning. Graph-matching techniques establish links between reward machines from distinct environments, thus enabling cross-domain reward translation within semi-alignable MDP settings. This broadens the applicability of DRL across multiple domains. Experiments substantiate our approach's effectiveness in tasks under environments with semi-alignable MDPs.

Paperid:1024 poster

Authors:Xinzhi Zhang, Hohei Chan, Deheng Ye, Yi Cai, Mengchen Zhao

Title: Ad Hoc Teamwork via Offline Goal-Based Decision Transformers

Abstract:
The ability of agents to collaborate with previously unknown teammates on the fly, known as ad hoc teamwork (AHT), is crucial in many real-world applications. Existing approaches to AHT require online interactions with the environment and some carefully designed teammates. However, these prerequisites can be infeasible in practice. In this work, we extend the AHT problem to the offline setting, where the policy of the ego agent is directly learned from a multi-agent interaction dataset. We propose a hierarchical sequence modeling framework called TAGET that addresses critical challenges in the offline setting, including limited data, partial observability and online adaptation. The core idea of TAGET is to dynamically predict teammate-aware rewards-to-go and sub-goals, so that the ego agent can adapt to the changes of teammates�?behaviors in real time. Extensive experimental results show that TAGET significantly outperforms existing solutions to AHT in the offline setting.

Paperid:1025 poster

Authors:Ruizhe Chen, Dongyu Xue, Xiangxin Zhou, Zaixiang Zheng, xiangxiang Zeng, Quanquan Gu

Title: An All-Atom Generative Model for Designing Protein Complexes

Abstract:
Proteins typically exist in complexes, interacting with other proteins or biomolecules to perform their specific biological roles. Research on single-chain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold2. Despite these developments, the study and modeling of multi-chain proteins remain largely uncharted, though they are vital for understanding biological functions. Recognizing the importance of these interactions, we introduce APM (all-Atom Protein generative Model), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse-folding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it achieves enhanced performance through supervised fine-tuning (SFT) while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results. We released our code athttps://github.com/bytedance/apm.

Paperid:1026 poster

Authors:Wenhui Zhu, Peijie Qiu, Xiwen Chen, Zhangsihao Yang, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang

Title: How Effective Can Dropout Be in Multiple Instance Learning ?

Abstract:
Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at \url{https://github.com/ChongQingNoSubway/MILDropout}.

Paperid:1027 poster

Authors:Manon Revel, Smitha Milli, Tyler Lu, Jamelle Watson-Daniels, Maximilian Nickel

Title: Representative Ranking for Deliberation in the Public Sphere

Abstract:
Online comment sections, such as those on news sites or social media, have the potential to foster informal public deliberation, However, this potential is often undermined by the frequency of toxic or low-quality exchanges that occur in these settings. To combat this, platforms increasingly leverage algorithmic ranking to facilitate higher-quality discussions, e.g., by using civility classifiers or forms of prosocial ranking. Yet, these interventions may also inadvertently reduce the visibility of legitimate viewpoints, undermining another key aspect of deliberation: representation of diverse views. We seek to remedy this problem by introducing guarantees of representation into these methods. In particular, we adopt the notion ofjustified representation(JR) from the social choice literature and incorporate a JR constraint into the comment ranking setting. We find that enforcing JR leads to greater inclusion of diverse viewpoints while still being compatible with optimizing for user engagement or other measures of conversational quality.

Paperid:1028 poster

Authors:Yunzhen Feng, Ariel Kwiatkowski, Kunhao Zheng, Julia Kempe, Yaqi Duan

Title: PILAF: Optimal Human Preference Sampling for Reward Modeling

Abstract:
As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

Paperid:1029 poster

Authors:Seunghan Lee, Taeyoung Park, Kibok Lee

Title: Channel Normalization for Time Series Channel Identification

Abstract:
Channel identifiability (CID) refers to the ability to distinguish among individual channels in time series (TS) modeling. The absence of CID often results in producing identical outputs for identical inputs, disregarding channel-specific characteristics. In this paper, we highlight the importance of CID and propose Channel Normalization (CN), a simple yet effective normalization strategy that enhances CID by assigning distinct affine transformation parameters to each channel. We further extend CN in two ways: 1) Adaptive CN (ACN) dynamically adjusts parameters based on the input TS, improving adaptability in TS models, and 2) Prototypical CN (PCN) introduces a set of learnable prototypes instead of per-channel parameters, enabling applicability to datasets with unknown or varying number of channels and facilitating use in TS foundation models. We demonstrate the effectiveness of CN and its variants by applying them to various TS models, achieving significant performance gains for both non-CID and CID models. In addition, we analyze the success of our approach from an information theory perspective. Code is available athttps://github.com/seunghan96/CN.

Paperid:1030 poster

Authors:Xiandong Zou, Wanyu Lin, Yuchen Li, Pan Zhou

Title: HPS: Hard Preference Sampling for Human Preference Alignment

Abstract:
Aligning Large Language Model (LLM) responses with human preferences is vital for building safe and controllable AI systems. While preference optimization methods based on Plackett-Luce (PL) and Bradley-Terry (BT) models have shown promise, they face challenges such as poor handling of harmful content, inefficient use of dispreferred responses, and, specifically for PL, high computational costs. To address these issues, we propose Hard Preference Sampling (HPS), a novel framework for robust and efficient human preference alignment. HPS introduces a training loss that prioritizes the most preferred response while rejecting all dispreferred and harmful ones. It emphasizes “hard�?dispreferred responses �?those closely resembling preferred ones �?to enhance the model’s rejection capabilities. By leveraging a single-sample Monte Carlo sampling strategy, HPS reduces computational overhead while maintaining alignment quality. Theoretically, HPS improves sample efficiency over existing PL methods and maximizes the reward margin between preferred and dispreferred responses, ensuring clearer distinctions. Experiments on HH-RLHF and PKU-Safety datasets validate HPS’s effectiveness, achieving comparable BLEU and reward scores while greatly improving reward margins and thus reducing harmful content generation.

Paperid:1031 poster

Authors:Kaixuan Zhang, HuWang, Minxian Li, Mingwu Ren, Mao Ye, Xiatian Zhu

Title: High Dynamic Range Novel View Synthesis with Single Exposure

Abstract:
High Dynamic Range Novel View Synthesis (HDR-NVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has significant limitations, including susceptibility to motion artifacts (e.g., ghosting and blurring), high capture and storage costs. To overcome these challenges, we introduce, for the first time, the single-exposure HDR-NVS problem, where only single exposure LDR images are available during training. We further introduce a novel approach, Mono-HDR-3D, featuring two dedicated modules formulated by the LDR image formation principles, one for converting LDR colors to HDR counterparts, and the other for transforming HDR images to LDR format so that unsupervised learning is enabled in a closed loop. Designed as a meta-algorithm, our approach can be seamlessly integrated with existing NVS models. Extensive experiments show that Mono-HDR-3D significantly outperforms previous methods. Source code is released athttps://github.com/prinasi/Mono-HDR-3D.

Paperid:1032 poster

Authors:Zhuoling Li, LiangLiang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu, Zhenhua Xu, Xiang Bai, Hengshuang Zhao

Title: VIP: Vision Instructed Pre-training for Robotic Manipulation

Abstract:
The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like ``opening the lid of a tightly sealed bottle''.

Paperid:1033 poster

Authors:Taiyu Ban, Changxin Rong, Xiangyu Wang, Lyuzhou Chen, Xin Wang, Derui Lyu, Qinrui Zhu, Huanhuan Chen

Title: Differentiable Structure Learning with Ancestral Constraints

Abstract:
Differentiable structure learning of causal directed acyclic graphs (DAGs) is an emerging field in causal discovery, leveraging powerful neural learners. However, the incorporation of ancestral constraints, essential for representing abstract prior causal knowledge, remains an open research challenge. This paper addresses this gap by introducing a generalized framework for integrating ancestral constraints. Specifically, we identify two key issues: the non-equivalence of relaxed characterizations for representing path existence and order violations among paths during optimization. In response, we propose a binary-masked characterization method and an order-guided optimization strategy, tailored to address these challenges. We provide theoretical justification for the correctness of our approach, complemented by experimental evaluations on both synthetic and real-world datasets.

Paperid:1034 poster

Authors:Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Chonghan Gao, Rongye Shi, Shanghang Zhang, Jianxin Li

Title: OmniArch: Building Foundation Model for Scientific Computing

Abstract:
Foundation models have revolutionized language modeling, while whether this success is replicated in scientific computing remains unexplored. We present OmniArch, the first prototype aiming at solving multi-scale and multi-physics scientific computing problems with physical alignment. We addressed all three challenges with one unified architecture. Its pre-training stage contains a Fourier Encoder-decoder fading out the disharmony across separated dimensions and a Transformer backbone integrating quantities through temporal dynamics, and the novel PDE-Aligner performs physics-informed fine-tuning under flexible conditions. As far as we know, we first conduct 1D-2D-3D united pre-training on the PDEBench, and it sets not only new performance benchmarks for 1D, 2D, and 3D PDEs but also demonstrates exceptional adaptability to new physics via in-context and zero-shot learning approaches, which supports realistic engineering applications and foresight physics discovery.

Paperid:1035 poster

Authors:Anant Khandelwal

Title: FlexiClip: Locality-Preserving Free-Form Character Animation

Abstract:
Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional Bézier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page:https://creative-gen.github.io/flexiclip.github.io/

Paperid:1036 poster

Authors:Zimeng Li, Hongjun LI, Jingyuan Wang, Ke Tang

Title: Approximation to Smooth Functions by Low-Rank Swish Networks

Abstract:
While deep learning has witnessed remarkable achievements in a wide range of applications, its substantial computational cost imposes limitations on application scenarios of neural networks. To alleviate this problem, low-rank compression is proposed as a class of efficient and hardware-friendly network compression methods, which reduce computation by replacing large matrices in neural networks with products of two small ones. In this paper, we implement low-rank networks by inserting a sufficiently narrow linear layer without bias between each of two adjacent nonlinear layers. We prove that low-rank Swish networks with a fixed depth are capable of approximating any function from the Hölder ball $\mathcal{C}^{\beta, R}([0,1]^d)$ within an arbitrarily small error where $\beta$ is the smooth parameter and $R$ is the radius. Our proposed constructive approximation ensures that the width of linear hidden layers required for approximation is no more than one-third of the width of nonlinear layers, which implies that the computational cost can be decreased by at least one-third compared with a network with the same depth and width of nonlinear layers but without narrow linear hidden layers. Our theoretical finding can offer a theoretical basis for low-rank compression from the perspective of universal approximation theory.

Paperid:1037 poster

Authors:Sophie Hanna Langbein, Niklas Koenen, Marvin N. Wright

Title: Gradient-based Explanations for Deep Learning Survival Models

Abstract:
Deep learning survival models often outperform classical methods in time-to-event predictions, particularly in personalized medicine, but their "black box" nature hinders broader adoption. We propose a framework for gradient-based explanation methods tailored to survival neural networks, extending their use beyond regression and classification. We analyze the implications of their theoretical assumptions for time-dependent explanations in the survival setting and propose effective visualizations incorporating the temporal dimension. Experiments on synthetic data show that gradient-based methods capture the magnitude and direction of local and global feature effects, including time dependencies. We introduce GradSHAP(t), a gradient-based counterpart to SurvSHAP(t), which outperforms SurvSHAP(t) and SurvLIME in a computational speed vs. accuracy trade-off. Finally, we apply these methods to medical data with multi-modal inputs, revealing relevant tabular features and visual patterns, as well as their temporal dynamics.

Paperid:1038 poster

Authors:Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh

Title: OR-Bench: An Over-Refusal Benchmark for Large Language Models

Abstract:
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available athttps://huggingface.co/bench-llmsand our codebase is open-sourced athttps://github.com/justincui03/or-bench. We hope this benchmark can help the community develop better safety aligned models.

Paperid:1039 poster

Authors:Yinpeng Chen, Dongdong Chen, Xiyang Dai, Mengchen Liu, Yinan Feng, Youzuo Lin, Lu Yuan, Zicheng Liu

Title: Exploring Invariance in Images through One-way Wave Equations

Abstract:
In this paper, we empirically demonstrate that natural images can be reconstructed with high fidelity from compressed representations using a simple first-order norm-plus-linear autoregressive (FINOLA) process—without relying on explicit positional information. Through systematic analysis, we observe that the learned coefficient matrices ($\mathbf{A}$ and $\mathbf{B}$) in FINOLA are typically invertible, and their product, $\mathbf{AB}^{-1}$, is diagonalizable across training runs. This structure enables a striking interpretation: FINOLA’s latent dynamics resemble a system of one-way wave equations evolving in a compressed latent space. Under this framework, each image corresponds to a unique solution of these equations. This offers a new perspective on image invariance, suggesting that the underlying structure of images may be governed by simple, invariant dynamic laws. Our findings shed light on a novel avenue for understanding and modeling visual data through the lens of latent-space dynamics and wave propagation.

Paperid:1040 poster

Authors:Weichen Li, Albert Jan, Baishakhi Ray, Junfeng Yang, Chengzhi Mao, Kexin Pei

Title: EditLord: Learning Code Transformation Rules for Code Editing

Abstract:
Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code's intended functionality. Existing approaches often formulate code editing as an implicit end-to-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps, and thus suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets. Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing. EditLord outperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness, across critical software engineering and security applications, LM models, and editing modes.

Paperid:1041 poster

Authors:Haorui Wang, Jeff Guo, Lingkai Kong, Rampi Ramprasad, Philippe Schwaller, Yuanqi Du, Chao Zhang

Title: LLM-Augmented Chemical Synthesis and Design Decision Programs

Abstract:
Retrosynthesis, the process of breaking down a target molecule into simpler precursors through a series of valid reactions, stands at the core of organic chemistry and drug development. Although recent machine learning (ML) research has advanced single-step retrosynthetic modeling and subsequent route searches, these solutions remain restricted by the extensive combinatorial space of possible pathways. Concurrently, large language models (LLMs) have exhibited remarkable chemical knowledge, hinting at their potential to tackle complex decision-making tasks in chemistry. In this work, we explore whether LLMs can successfully navigate the highly constrained, multi-step retrosynthesis planning problem. We introduce an efficient scheme for encoding reaction pathways and present a new route-level search strategy, moving beyond the conventional step-by-step reactant prediction. Through comprehensive evaluations, we show that our LLM-augmented approach excels at retrosynthesis planning and extends naturally to the broader challenge of synthesizable molecular design.

Paperid:1042 poster

Authors:Xinyu Zhao, Fangcong Yin, Greg Durrett

Title: Understanding Synthetic Context Extension via Retrieval Heads

Abstract:
Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle'' concepts to be retrieved and diversity of the surrounding "haystack'' context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.

Paperid:1043 poster

Authors:Prateek Gupta, Andrea E. V. Ferrari, Nabil Iqbal

Title: A Machine Learning Approach to Duality in Statistical Physics

Abstract:
The notion of duality -- that a given physical system can have two different mathematical descriptions -- is a key idea in modern theoretical physics. Establishing a duality in lattice statistical mechanics models requires the construction of a dual Hamiltonian and a map from the original to the dual observables. By using neural networks to parameterize these maps and introducing a loss function that penalises the difference between correlation functions in original and dual models, we formulate the process of duality discovery as an optimization problem. We numerically solve this problem and show that our framework can rediscover the celebrated Kramers-Wannier duality for the 2d Ising model, numerically reconstructing the known mapping of temperatures. We further investigate the 2d Ising model deformed by a plaquette coupling and find families of ``approximate duals''. We discuss future directions and prospects for discovering new dualities within this framework.

Paperid:1044 poster

Authors:André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

Title: DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Abstract:
How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data?Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model’s training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. We provide the code in the supplementary materials.

Paperid:1045 poster

Authors:Dario Coscia, Max Welling, Nicola Demo, Gianluigi Rozza

Title: BARNN: A Bayesian Autoregressive and Recurrent Neural Network

Abstract:
Autoregressive and recurrent networks have achieved remarkable progress across various fields, from weather forecasting to molecular generation and Large Language Models. Despite their strong predictive capabilities, these models lack a rigorous framework for addressing uncertainty, which is key in scientific applications such as PDE solving, molecular generation and machine l earning Force Fields. To address this shortcoming we present BARNN: a variational Bayesian Autoregressive and Recurrent Neural Network. BARNNs aim to provide a principled way to turn any autoregressive or recurrent model into its Bayesian version. BARNN is based on the variational dropout method, allowing to apply it to large recurrent neural networks as well. We also introduce a temporal version of the “Variational Mixtures of Posteriors�?prior (tVAMP-prior) to make Bayesian inference efficient and well-calibrated. Extensive experiments on PDE modelling and molecular generation demonstrate that BARNN not only achieves comparable or superior accuracy compared to existing methods, but also excels in uncertainty quantification and modelling long-range dependencies.

Paperid:1046 poster

Authors:Jacob Bamberger, Benjamin Gutteridge, Scott le Roux, Michael M. Bronstein, Xiaowen Dong

Title: On Measuring Long-Range Interactions in Graph Neural Networks

Abstract:
Long-range graph tasks --- those dependent on interactions between `distant' nodes --- are an open problem in graph neural network research. Real-world benchmark tasks, especially the Long Range Graph Benchmark, have become popular for validating the long-range capability of proposed architectures. However, this is an empirical approach that lacks both robustness and theoretical underpinning; a more principled characterization of the long-range problem is required. To bridge this gap, we formalize long-range interactions in graph tasks, introduce arange measurefor operators on graphs, and validate it with synthetic experiments. We then leverage our measure to examine commonly used tasks and architectures, and discuss to what extent they are, in fact, long-range. We believe our work advances efforts to define and address the long-range problem on graphs, and that our range measure will aid evaluation of new datasets and architectures.

Paperid:1047 poster

Authors:Constantinos Costis Daskalakis, ANTHIMOS VARDIS KANDIROS, Rui Yao

Title: Learning Gaussian DAG Models without Condition Number Bounds

Abstract:
We study the problem of learning the topology of a directed Gaussian Graphical Model under the equal-variance assumption, where the graph has $n$ nodes and maximum in-degree $d$. Prior work has established that $O(d \log n)$ samples are sufficient for this task. However, an important factor that is often overlooked in these analyses is the dependence on the condition number of the covariance matrix of the model. Indeed, all algorithms from prior work require a number of samples that grows polynomially with this condition number. In many cases this is unsatisfactory, since the condition number could grow polynomially with $n$, rendering these prior approaches impractical in high-dimensional settings. In this work, we provide an algorithm that recovers the underlying graph and prove that the number of samples required is independent of the condition number. Furthermore, we establish lower bounds that nearly match the upper bound up to a $d$-factor, thus providing an almost tight characterization of the true sample complexity of the problem. Moreover, under a further assumption that all the variances of the variables are bounded, we design a polynomial-time algorithm that recovers the underlying graph, at the cost of an additional polynomial dependence of the sample complexity on $d$. We complement our theoretical findings with simulations on synthetic datasets that confirm our predictions.

Paperid:1048 poster

Authors:YUTONG WU, Jie Zhang, Yiming Li, Chao Zhang, Qing Guo, Han Qiu, Nils Lukas, Tianwei Zhang

Title: Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems

Abstract:
Vision Language Model (VLM) Agents are stateful, autonomous entities capable of perceiving and interacting with their environments through vision and language. Multi-agent systems comprise specialized agents who collaborate to solve a (complex) task. A core security property isrobustness, stating that the system maintains its integrity during adversarial attacks. Multi-agent systems lack robustness, as a successful exploit against one agent can spread andinfectother agents to undermine the entire system's integrity. We propose a defense Cowpox to provably enhance the robustness of a multi-agent system by a distributed mechanism that improves therecovery rateof agents by limiting the expected number of infections to other agents. The core idea is to generate and distribute a specialcure samplethat immunizes an agent against the attack before exposure. We demonstrate the effectiveness of Cowpox empirically and provide theoretical robustness guarantees.

Paperid:1049 poster

Authors:Shiba Biswal, Karthik Elamvazhuthi, Rishi Sonthalia

Title: Universal Approximation of Mean-Field Models via Transformers

Abstract:
This paper investigates the use of transformers to approximate the mean-field dynamics of interacting particle systems exhibiting collective behavior. Such systems are fundamental in modeling phenomena across physics, biology, and engineering, including opinion formation, biological networks, and swarm robotics. The key characteristic of these systems is that the particles are indistinguishable, leading to permutation-equivariant dynamics. First, we empirically demonstrate that transformers are well-suited for approximating a variety of mean field models, including the Cucker-Smale model for flocking and milling, and the mean-field system for training two-layer neural networks. We validate our numerical experiments via mathematical theory. Specifically, we prove that if a finite-dimensional transformer effectively approximates the finite-dimensional vector field governing the particle system, then the $L_\infty$ distance between the \textit{expected transformer} and the infinite-dimensional mean-field vector field can be bounded by a function of the number of particles observed during training. Leveraging this result, we establish theoretical bounds on the distance between the true mean-field dynamics and those obtained using the transformer.

Paperid:1050 poster

Authors:Linda Lu, Ayush Sekhari, Karthik Sridharan

Title: System-Aware Unlearning Algorithms: Use Lesser, Forget Faster

Abstract:
Machine unlearning addresses the problem of updating a machine learning model/system trained on a dataset $S$ so that the influence of a set of deletion requests $U \subseteq S$ on the unlearned model is minimized. The gold standard definition of unlearning demands that the updated model, after deletion, be nearly identical to the model obtained by retraining. This definition is designed for a worst-case attacker (one who can recover not only the unlearned model but also the remaining data samples, i.e., $S \setminus U$). Such a stringent definition has made developing efficient unlearning algorithms challenging. However, such strong attackers are also unrealistic. In this work, we propose a new definition,system-aware unlearning, which aims to provide unlearning guarantees against an attacker that can at best only gain access to the data stored in the system for learning/unlearning requests and not all of $S\setminus U$. With this new definition, we use the simple intuition that if a system can store less to make its learning/unlearning updates, it can be more secure and update more efficiently against a system-aware attacker. Towards that end, we present an exact system-aware unlearning algorithm for linear classification using a selective sampling-based approach, and we generalize the method for classification with general function classes. We theoretically analyze the tradeoffs between deletion capacity, accuracy, memory, and computation time.

Paperid:1051 poster

Authors:Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen

Title: Metadata Conditioning Accelerates Language Model Pre-training

Abstract:
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like en.wikipedia.org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia.org to reduce harmful generations or factquizmaster.com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.

Paperid:1052 poster

Authors:Joongkyu Lee, Min-hwan Oh

Title: Combinatorial Reinforcement Learning with Preference Feedback

Abstract:
In this paper, we consider combinatorial reinforcement learning with preference feedback, where a learning agent sequentially offers an action—an assortment of multiple items—to a user, whose preference feedback follows a multinomial logistic (MNL) model. This framework allows us to model real-world scenarios, particularly those involving long-term user engagement, such as in recommender systems and online advertising. However, this framework faces two main challenges: (1) the unknown value of each item, unlike traditional MNL bandits that only address single-step preference feedback, and (2) the difficulty of ensuring optimism while maintaining tractable assortment selection in the combinatorial action space with unknown values. In this paper, we assume a contextual MNL preference model, where the mean utilities are linear, and the value of each item is approximated by a general function. We propose an algorithm, MNL-VQL, that addresses these challenges, making it both computationally and statistically efficient. As a special case, for linear MDPs (with the MNL preference feedback), we establish the first regret lower bound in this framework and show that MNL-VQL achieves near-optimal regret. To the best of our knowledge, this is the first work to provide statistical guarantees in combinatorial RL with preference feedback.

Paperid:1053 poster

Authors:Tommaso Mencattini, Robert Adrian Minut, Donato Crisostomi, Andrea Santilli, Emanuele Rodolà

Title: MERGE3: Efficient Evolutionary Merging on Consumer-grade GPUs

Abstract:
Evolutionary model merging enables the creation of high-performing multi-task models but remains computationally prohibitive for consumer hardware. We introduce MERGE$^3$, an efficient framework that makes evolutionary merging of Large Language Models (LLMs) feasible on a single GPU by reducing fitness computation costs 50× while retaining a large fraction of the original performance. MERGE$^3$ achieves this byExtracting a reduced dataset for evaluation,Estimating model abilities using Item Response Theory (IRT), andEvolving optimal merges via IRT-based performance estimators. Our method enables state-of-the-art multilingual and cross-lingual merging, transferring knowledge across languages with significantly lower computational overhead. We provide theoretical guarantees and an open-source library, democratizing high-quality model merging.

Paperid:1054 poster

Authors:Adeel Pervez, Efstratios Gavves, Francesco Locatello

Title: Mechanistic PDE Networks for Discovery of Governing Equations

Abstract:
We present Mechanistic PDE Networks -- a model for discovery of governingpartial differential equationsfrom data. Mechanistic PDE Networks represent spatiotemporal data as space-time dependentlinearpartial differential equations in neural network hidden representations. The represented PDEs are then solved and decoded for specific tasks. The learned PDE representations naturally express the spatiotemporal dynamics in data in neural network hidden space, enabling increased modeling power. Solving the PDE representations in a compute and memory-efficient way, however, is a significant challenge. We develop a native, GPU-capable, parallel, sparse and differentiable multigrid solver specialized for linear partial differential equations that acts as a module in Mechanistic PDE Networks. Leveraging the PDE solver we propose a discovery architecture that can discovers nonlinear PDEs in complex settings, while being robust to noise. We validate PDE discovery on a number of PDEs including reaction-diffusion and Navier-Stokes equations.

Paperid:1055 poster

Authors:Ying-Peng Tang, Chao Ren, Xiaoli Tang, Sheng-Jun Huang, Lizhen Cui, Han Yu

Title: Efficient Heterogeneity-Aware Federated Active Data Selection

Abstract:
Federated Active Learning (FAL) aims to learn an effective global model, while minimizing label queries. Owing to privacy requirements, it is challenging to design effective active data selection schemes due to the lack of cross-client query information. In this paper, we bridge this important gap by proposing the \underline{F}ederated \underline{A}ctive data selection by \underline{LE}verage score sampling (FALE) method. It is designed for regression tasks in the presence of non-i.i.d. client data to enable the server to select data globally in a privacy-preserving manner. Based on FedSVD, FALE aims to estimate the utility of unlabeled data and perform data selection via leverage score sampling. Besides, a secure model learning framework is designed for federated regression tasks to exploit supervision. FALE can operate without requiring an initial labeled set and select the instances in a single pass, significantly reducing communication overhead. Theoretical analyze establishes the query complexity for FALE to achieve constant factor approximation and relative error approximation. Extensive experiments on 11 benchmark datasets demonstrate significant improvements of FALE over existing state-of-the-art methods.

Paperid:1056 poster

Authors:Yongxiang Tang, Yanhua cheng, Xiaocheng Liu, Jiaochenchen, Yanxiang Zeng, Ning Luo, Pengjia Yuan, Xialong Liu, Peng Jiang

Title: Learning Monotonic Probabilities with a Generative Cost Model

Abstract:
In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques, whereas this paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, termed the Generative Cost Model (GCM), which inherently addresses the strict monotonic problem, and propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found athttps://github.com/tyxaaron/GCM.

Paperid:1057 poster

Authors:Hui Nie, Zhao Zhang, Yutao Cheng, Maoke Yang, Gonglei Shi, Qingsong Xie, Jie Shao, Xinglong Wu

Title: Decomposition of Graphic Design with Unified Multimodal Model

Abstract:
We propose Layer Decomposition of Graphic Designs (LDGD), a novel vision task that converts composite graphic design (e.g., posters) into structured representations comprising ordered RGB-A layers and metadata. By transforming visual content into structured data, LDGD facilitates precise image editing and offers significant advantages for digital content creation, management, and reuse. This task presents two core challenges: (1) predicting the attribute information (metadata) of each layer, and (2) recovering the occluded regions within overlapping layers to enable high-fidelity image reconstruction. To address this, we present the Decompose Layer Model (DeaM), a large unified multimodal model that integrates a conjoined visual encoder, a language model, and a condition-aware RGB-A decoder. DeaM adopts a two-stage processing pipeline: first generates layer-specific metadata containing information such as spatial coordinates and quantized encodings, and then reconstructs pixel-accurate layer images using a condition-aware RGB-A decoder. Beyond full decomposition, the model supports interactive decomposition via textual or point-based prompts. Extensive experiments demonstrate the effectiveness of the proposed method. The code is accessed athttps://github.com/witnessai/DeaM.

Paperid:1058 poster

Authors:Yegon Kim, Hyunsu Kim, Gyeonghoon Ko, Juho Lee

Title: Active Learning with Selective Time-Step Acquisition for PDEs

Abstract:
Accurately solving partial differential equations (PDEs) is critical to understanding complex scientific and engineering phenomena, yet traditional numerical solvers are computationally expensive. Surrogate models offer a more efficient alternative, but their development is hindered by the cost of generating sufficient training data from numerical solvers. In this paper, we present a novel framework for active learning (AL) in PDE surrogate modeling that reduces this cost. Unlike the existing AL methods for PDEs that always acquire entire PDE trajectories, our approach strategically generates only the most important time steps with the numerical solver, while employing the surrogate model to approximate the remaining steps. This dramatically reduces the cost incurred by each trajectory and thus allows the active learning algorithm to try out a more diverse set of trajectories given the same budget. To accommodate this novel framework, we develop an acquisition function that estimates the utility of a set of time steps by approximating its resulting variance reduction. We demonstrate the effectiveness of our method on several benchmark PDEs, including the Burgers' equation, Korteweg–De Vries equation, Kuramoto–Sivashinsky equation, the incompressible Navier-Stokes equation, and the compressible Navier-Stokes equation. Experiments show that our approach improves performance by large margins over the best existing method. Our method not only reduces average error but also the 99%, 95%, and 50% quantiles of error, which is rare for an AL algorithm. All in all, our approach offers a data-efficient solution to surrogate modeling for PDEs.

Paperid:1059 poster

Authors:Chang Liu, Jiangrong Shen, Xuming Ran, Mingkun Xu, Qi Xu, Yi Xu, Gang Pan

Title: Efficient ANN-SNN Conversion with Error Compensation Learning

Abstract:
Artificial neural networks (ANNs) have demonstrated outstanding performance in numerous tasks, but deployment in resource-constrained environments remains a challenge due to their high computational and memory requirements. Spiking neural networks (SNNs) operate through discrete spike events and offer superior energy efficiency, providing a bio-inspired alternative. However, current ANN-to-SNN conversion often results in significant accuracy loss and increased inference time due to conversion errors such as clipping, quantization, and uneven activation. This paper proposes a novel ANN-to-SNN conversion framework based on error compensation learning. We introduce a learnable threshold clipping function, dual-threshold neurons, and an optimized membrane potential initialization strategy to mitigate the conversion error. Together, these techniques address the clipping error through adaptive thresholds, dynamically reduce the quantization error through dual-threshold neurons, and minimize the non-uniformity error by effectively managing the membrane potential. Experimental results on CIFAR-10, CIFAR-100, ImageNet datasets show that our method achieves high-precision and ultra-low latency among existing conversion methods. Using only two time steps, our method significantly reduces the inference time while maintains competitive accuracy of 94.75% on CIFAR-10 dataset under ResNet-18 structure. This research promotes the practical application of SNNs on low-power hardware, making efficient real-time processing possible.

Paperid:1060 poster

Authors:Guangyi Wang, Wei Peng, lijiang Li, Wenyu Chen, Yuren Cai, Song-Zhi Su

Title: Diffusion Sampling Correction via Approximately 10 Parameters

Abstract:
While powerful for generation, Diffusion Probabilistic Models (DPMs) face slow sampling challenges, for which various distillation-based methods have been proposed. However, they typically require significant additional training costs and model parameter storage, limiting their practicality. In this work, we proposePCA-basedAdaptiveSearch (PAS), which optimizes existing solvers for DPMs with minimal additional costs. Specifically, we first employ PCA to obtain a few basis vectors to span the high-dimensional sampling space, which enables us to learn just a set of coordinates to correct the sampling direction; furthermore, based on the observation that the cumulative truncation error exhibits an ``S"-shape, we design an adaptive search strategy that further enhances the sampling efficiency and reduces the number of stored parameters to approximately 10. Extensive experiments demonstrate that PAS can significantly enhance existing fast solvers in a plug-and-play manner with negligible costs. E.g., on CIFAR10, PAS optimizes DDIM's FID from 15.69 to 4.37 (NFE=10) using only12 parameters and sub-minute trainingon a single A100 GPU. Code is available athttps://github.com/onefly123/PAS.

Paperid:1061 poster

Authors:Yupeng Qiu, Han Fang, Ee-Chien Chang

Title: Lightweight-Mark: Rethinking Deep Learning-Based Watermarking

Abstract:
Deep learning-based watermarking models play a crucial role in copyright protection across various applications. However, many high-performance models are limited in practical deployment due to their large number of parameters. Meanwhile, the robustness and invisibility performance of existing lightweight models are unsatisfactory. This presents a pressing need for a watermarking model that combines lightweight capacity with satisfactory performance. Our research identifies a key reason that limits the performance of existing watermarking frameworks: a mismatch between commonly used decoding losses (e.g., mean squared error and binary cross-entropy loss) and the actual decoding goal, leading to parameter redundancy. We propose two innovative solutions: (1) Decoding-oriented surrogate loss (DO), which redesigns the loss function to mitigate the influence of decoding-irrelevant optimization directions; and (2) Detachable projection head (PH), which incorporates a detachable redundant module during training to handle these irrelevant directions and is discarded during inference. Additionally, we propose a novel watermarking framework comprising five submodules, allowing for independent parameter reduction in each component. Our proposed model achieves better efficiency, invisibility, and robustness while utilizing only 2.2% of the parameters compared to the state-of-the-art frameworks. By improving efficiency while maintaining robust copyright protection, our model is well suited for practical applications in resource-constrained environments. The DO and PH methods are designed to be plug-and-play, facilitating seamless integration into future lightweight models.

Paperid:1062 poster

Authors:Tuan Truong, Quyen Tran, Ngoc-Quan Pham, Nhat Ho, Dinh Phung, Trung Le

Title: Improving Generalization with Flat Hilbert Bayesian Inference

Abstract:
We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within the reproducing kernel Hilbert spaces. This methodology is supported by a theoretical analysis that extends previous findings on generalization ability from finite-dimensional Euclidean spaces to infinite-dimensional functional spaces. To evaluate the effectiveness of FHBI, we conduct comprehensive comparisons against nine baseline methods on the VTAB-1K benchmark, which encompasses 19 diverse datasets across various domains with diverse semantics. Empirical results demonstrate that FHBI consistently outperforms the baselines by notable margins, highlighting its practical efficacy.

Paperid:1063 poster

Authors:Alessio Russo, Aldo Pacchiano

Title: Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

Abstract:
We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(\epsilon,\delta)$-PAC perspective to achieve $\epsilon$-accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme.

Paperid:1064 poster

Authors:Sunil Madhow, Dheeraj Baby, Yu-Xiang Wang

Title: AKORN: Adaptive Knots generated Online for RegressioN splines

Abstract:
In order to attain optimal rates, state-of-the-art algorithms for non-parametric regression require that a hyperparameter be tuned according to the smoothness of the ground truth (Tibshirani, 2014). This amounts to an assumption of oracle access to certain features of the data-generating process. We present a parameter-free algorithm for offline non-parametric regression over $TV_1$-bounded functions. By feeding offline data into an optimal online denoising algorithm styled after (Baby et al., 2021), we are able to use change-points to adaptively select knots that respect the geometry of the underlying ground truth. We call this procedure AKORN (Adaptive Knots gener- ated Online for RegressioN splines). By combining forward and backward passes over the data, we obtain an estimator whose empirical performance is close to Trend Filtering (Kim et al., 2009; Tibshirani, 2014), even when we provide the latter with oracle knowledge of the ground truth’s smoothness.

Paperid:1065 poster

Authors:Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha

Title: Validating Mechanistic Interpretations: An Axiomatic Approach

Abstract:
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of amechanistic interpretationitself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.

Paperid:1066 poster

Authors:Umit Yigit Basaran, Sk Miraj Ahmed, Amit Roy-Chowdhury, Basak Guler

Title: A Certified Unlearning Approach without Access to Source Data

Abstract:
With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal without access to the original training data samples. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees. This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.

Paperid:1067 poster

Authors:Thomas Yuan-Lung Lin, Jerry Yao-Chieh Hu, Paul W. Chiou, Peter Lin

Title: Latent Variable Estimation in Bayesian Black-Litterman Models

Abstract:
We revisit the Bayesian Black–Litterman (BL) portfolio model and remove its reliance on subjective investor views. Classical BL requires an investor “view�? a forecast vectorqand its uncertainty matrixΩthat describe how much a chosen portfolio should outperform the market. Our key idea is to treat(q,Ω)as latent variables and learn them from market data within a single Bayesian network. Consequently, the resulting posterior estimation admits closed-form expression, enabling fast inference and stable portfolio weights. Building on these, we propose two mechanisms to capture how features interact with returns: shared-latent parametrization and feature-influenced views; both recover classical BL and Markowitz portfolios as special cases. Empirically, on 30-year Dow-Jones and 20-year sector-ETF data, we improve Sharpe ratios by 50% and cut turnover by 55% relative to Markowitz and the index baselines. This work turns BL into a fully data-driven, view-free, and coherent Bayesian framework for portfolio optimization.

Paperid:1068 poster

Authors:Toby Boyne, Jose Pablo Folch, Robert Matthew Lee, Behrang Shafei, Ruth Misener

Title: BARK: A Fully Bayesian Tree Kernel for Black-box Optimization

Abstract:
We perform Bayesian optimization using a Gaussian process perspective on Bayesian Additive Regression Trees (BART). Our BART Kernel (BARK) uses tree agreement to define a posterior over piecewise-constant functions, and we explore the space of tree kernels using a Markov chain Monte Carlo approach. Where BART only samples functions, the resulting BARK model obtains samples of Gaussian processes defining distributions over functions, which allow us to build acquisition functions for Bayesian optimization. Our tree-based approach enables global optimization over the surrogate, even for mixed-feature spaces. Moreover, where many previous tree-based kernels provide uncertainty quantification over function values, our sampling scheme captures uncertainty over the tree structure itself. Our experiments show the strong performance of BARK on both synthetic and applied benchmarks, due to the combination of our fully Bayesian surrogate and the optimization procedure.

Paperid:1069 poster

Authors:Mustapha BOUNOUA, Giulio Franzese, Pietro Michiardi

Title: Learning to Match Unpaired Data with Minimum Entropy Coupling

Abstract:
Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.

Paperid:1070 poster

Authors:George Orfanides, Tim Hoheisel, Marwa El Halabi

Title: Discrete and Continuous Difference of Submodular Minimization

Abstract:
Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of submodular (DS) functions, over both domains, extending prior work restricted to set functions. We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares.

Paperid:1071 poster

Authors:Hao Zeng, Kangdao Liu, Bingyi Jing, Hongxin Wei

Title: Parametric Scaling Law of Tuning Bias in Conformal Prediction

Abstract:
Conformal prediction is a popular framework of uncertainty quantification that constructs prediction sets with coverage guarantees. To uphold the exchangeability assumption, many conformal prediction methods necessitate an additional hold-out set for parameter tuning. Yet, the impact of violating this principle on coverage remains underexplored, making it ambiguous in practical applications. In this work, we empirically find that the tuning bias - the coverage gap introduced by leveraging the same dataset for tuning and calibration, is negligible for simple parameter tuning in many conformal prediction methods. In particular, we observe the scaling law of the tuning bias: this bias increases with parameter space complexity and decreases with calibration set size. Formally, we establish a theoretical framework to quantify the tuning bias and provide rigorous proof for the scaling law of the tuning bias by deriving its upper bound. In the end, we discuss how to reduce the tuning bias, guided by the theories we developed.

Paperid:1072 poster

Authors:Masha Naslidnyk, Siu Lun Chau, Francois-Xavier Briol, Krikamol Muandet

Title: Kernel Quantile Embeddings and Associated Probability Metrics

Abstract:
Embedding probability distributions into reproducing kernel Hilbert spaces (RKHS) has enabled powerful nonparametric methods such as the maximum mean discrepancy (MMD), a statistical distance with strong theoretical and computational properties. At its core, the MMD relies on kernel mean embeddings to represent distributions as mean functions in RKHS. However, it remains unclear if the mean function is the only meaningful RKHS representation. Inspired by generalised quantiles, we introduce the notion ofkernel quantile embeddings (KQEs). We then use KQEs to construct a family of distances that: (i) are probability metrics under weaker kernel conditions than MMD; (ii) recover a kernelised form of the sliced Wasserstein distance; and (iii) can be efficiently estimated with near-linear cost. Through hypothesis testing, we show that these distances offer a competitive alternative to MMD and its fast approximations.

Paperid:1073 poster

Authors:Sayan Bhattacharya, Martin Costa, Ermiya Farokhnejad, Silvio Lattanzi, Nikos Parotsidis

Title: Almost Optimal Fully Dynamick-Center Clustering with Recourse

Abstract:
In this paper, we consider themetrick-centerproblem in the fully dynamic setting, where we are given a metric space(V,d)evolving via a sequence of point insertions and deletions and our task is to maintain a subsetS⊆Vof at mostkpoints that minimizes the objectivemaxx∈Vminy∈Sd(x,y). We want to design our algorithm so that we minimize itsapproximation ratio,recourse(the number of changes it makes to the solutionS) andupdate time(the time it takes to handle an update). We give a simple algorithm for dynamick-center that maintains aO(1)-approximate solution withO(1)amortized recourse andO~(k)amortized update time,obtaining near-optimal approximation, recourse and update time simultaneously. We obtain our result by combining a variant of the dynamick-center algorithm of Bateni et al. [SODA'23] with the dynamic sparsifier of Bhattacharya et al. [NeurIPS'23].

Paperid:1074 poster

Authors:Haitao Wu, Weiwei Li, Xiuyi Jia

Title: Divide and Conquer: Learning Label Distribution with Subtasks

Abstract:
Label distribution learning (LDL) is a novel learning paradigm that emulates label polysemy by assigning label distributions over the label space. However, recent LDL work seems to exhibit a notable contradiction: 1) existing LDL methods employ auxiliary tasks to enhance performance, which narrows their focus to specific applications, thereby lacking generalizability; 2) conversely, LDL methods without auxiliary tasks rely on losses tailored solely to the primary task, lacking beneficial data to guide the learning process. In this paper, we propose S-LDL, a novel and minimalist solution that generates subtask label distributions, i.e., a form of extra supervised information, to reconcile the above contradiction. S-LDL encompasses two key aspects: 1) an algorithm capable of generating subtasks without any prior/expert knowledge; and 2) a plug-andplay framework seamlessly compatible with existing LDL methods, and even adaptable to derivative tasks of LDL. Our analysis and experiments demonstrate that S-LDL is effective and efficient. To the best of our knowledge, this paper represents the first endeavor to address LDL via subtasks.

Paperid:1075 poster

Authors:Avinash Kori, Francesca Toni, Ben Glocker

Title: Identifiable Object Representations under Spatial Ambiguities

Abstract:
Modular object-centric representations are essential forhuman-like reasoningbut are challenging to obtain under spatial ambiguities,e.g. due to occlusions and view ambiguities. However, addressing challenges presents both theoretical and practical difficulties. We introduce a novel multi-view probabilistic approach that aggregates view-specific slots to captureinvariant contentinformation while simultaneously learning disentangled globalviewpoint-levelinformation. Unlike prior single-view methods, our approach resolves spatial ambiguities, provides theoretical guarantees for identifiability, and requiresno viewpoint annotations. Extensive experiments on standard benchmarks and novel complex datasets validate our method's robustness and scalability.

Paperid:1076 poster

Authors:Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin

Title: Improving Your Model Ranking on Chatbot Arena by Vote Rigging

Abstract:
Chatbot Arena is an open platform for evaluating LLMs by pairwise battles, in which users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can beriggedto improve (or decrease) the ranking of a target modelmt. We first introduce a straightforwardtarget-only riggingstrategy that focuses on new battles involvingmt, identifying it via watermarking or a binary classifier, and exclusively voting formtwins. However, this strategy is practically inefficient because there are over190models on Chatbot Arena and on average only about 1% of new battles will involvemt. To overcome this, we propose anomnipresent riggingstrategy, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target modelmt, even ifmtis not directly involved in the battle. We conduct experiments on around1.7 millionhistorical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategy can improve model rankings by rigging onlyhundreds ofnew votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging.Codeis publicly available to reproduce all experiments.

Paperid:1077 poster

Authors:Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, Ameet Talwalkar

Title: Copilot Arena: A Platform for Code LLM Evaluation in the Wild

Abstract:
Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce Copilot Arena, a platform to collect user preferences through native integration into a developer's working environment. Copilot Arena comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. Copilot Arena has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from Copilot Arena differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in Copilot Arena. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We open-source Copilot Arena and release data to enable human-centric evaluations and improve understanding of coding assistants.

Paperid:1078 poster

Authors:Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim

Title: ELITE: Enhanced Language-Image Toxicity Evaluation for Safety

Abstract:
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.

Paperid:1079 poster

Authors:Seng Pei Liew, Takuya Kato, Sho Takase

Title: Scaling Laws for Upcycling Mixture-of-Experts Language Models

Abstract:
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets.Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.

Paperid:1080 poster

Authors:Hao Zhou, Xu Yang, Mingyu Fan, Lu Qi, Xiangtai Li, Ming-Hsuan Yang, Fei Luo

Title: Three-Dimensional Trajectory Prediction with 3DMoTraj Dataset

Abstract:
With the growing interest in embodied and spatial intelligence, accurately predicting trajectories in 3D environments has become increasingly critical. However, no datasets have been explicitly designed to study 3D trajectory prediction. To this end, we contribute a 3D motion trajectory (3DMoTraj) dataset collected from unmanned underwater vehicles (UUVs) operating in oceanic environments. Mathematically, trajectory prediction becomes significantly more complex when transitioning from 2D to 3D. To tackle this challenge, we analyze the prediction complexity of 3D trajectories and propose a new method consisting of two key components: decoupled trajectory prediction and correlated trajectory refinement. The former decouples inter-axis correlations, thereby reducing prediction complexity and generating coarse predictions. The latter refines the coarse predictions by modeling their inter-axis correlations. Extensive experiments show that our method significantly improves 3D trajectory prediction accuracy and outperforms state-of-the-art methods. Both the 3DMoTraj dataset and the method are available athttps://github.com/zhouhao94/3DMoTraj.

Paperid:1081 poster

Authors:Yuhao Liu, Yu Chen, Rui Hu, Longbo Huang

Title: Finite-Time Analysis of Discrete-Time Stochastic Interpolants

Abstract:
The stochastic interpolant framework offers a powerful approach for constructing generative models based on ordinary differential equations (ODEs) or stochastic differential equations (SDEs) to transform arbitrary data distributions. However, prior analyses of this framework have primarily focused on the continuous-time setting, assuming perfect solution of the underlying equations. In this work, we present the first discrete-time analysis of the stochastic interpolant framework, where we introduce a innovative discrete-time sampler and derive a finite-time upper bound on its distribution estimation error. Our result provides a novel quantification on how different factors, including the distance between source and target distributions and estimation accuracy, affect the convergence rate and also offers a new principled way to design efficient schedule for convergence acceleration. Finally, numerical experiments are conducted on the discrete-time sampler to corroborate our theoretical findings.

Paperid:1082 poster

Authors:Edward Pearce-Crump

Title: Permutation Equivariant Neural Networks for Symmetric Tensors

Abstract:
Incorporating permutation equivariance into neural networks has proven to be useful in ensuring that models respect symmetries that exist in data. Symmetric tensors, which naturally appear in statistics, machine learning, and graph theory, are essential for many applications in physics, chemistry, and materials science, amongst others. However, existing research on permutation equivariant models has not explored symmetric tensors as inputs, and most prior work on learning from these tensors has focused on equivariance to Euclidean groups. In this paper, we present two different characterisations of all linear permutation equivariant functions between symmetric power spaces ofRn. We show on two tasks that these functions are highly data efficient compared to standard MLPs and have potential to generalise well to symmetric tensors of different sizes.

Paperid:1083 poster

Authors:Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao

Title: MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

Abstract:
LLM-based multi-agent systems (MAS) have shown significant potential in tackling diverse tasks. However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs. In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS. To address this novel task, we unify the representation of MAS as executable code and propose a consistency-oriented data construction pipeline to create a high-quality dataset comprising coherent and consistent query-MAS pairs. Using this dataset, we train MAS-GPT, an open-source medium-sized LLM that is capable of generating query-adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high-quality responses. Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT's high effectiveness, efficiency and strong generalization ability. The codes are released at \url{https://github.com/rui-ye/MAS-GPT}.

Paperid:1084 poster

Authors:Yihang Wang, Yuying Qiu, Peng Chen, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo

Title: LightGTS: A Lightweight General Time Series Forecasting Model

Abstract:
Existing works on general time series forecasting build foundation models with heavy model parameters through large-scale multi-source pretraining. These models achieve superior generalization ability across various datasets at the cost of significant computational burdens and limitations in resource-constrained scenarios. This paper introduces LightGTS, a lightweight general time series forecasting model designed from the perspective of consistent periodical modeling. To handle diverse scales and intrinsic periods in multi-source pre-training, we introduce Periodical Tokenization, which extracts consistent periodic patterns across different datasets with varying scales. To better utilize the periodicity in the decoding process, we further introduce Periodical Parallel Decoding, which leverage historical tokens to improve forecasting. Based on the two techniques above which fully leverage the inductive bias of periods inherent in time series, LightGTS uses a lightweight model to achieve outstanding performance on general time series forecasting. It achieves state-of-the-art forecasting performance on 9 real-world benchmarks in both zero-shot and full-shot setting with much better efficiency compared with existing time series foundation models

Paperid:1085 poster

Authors:Saqib Javed, Hieu Le, Mathieu Salzmann

Title: QT-DoG: Quantization-Aware Training for Domain Generalization

Abstract:
A key challenge in Domain Generalization (DG) is preventing overfitting to source domains, which can be mitigated by finding flatter minima in the loss landscape. In this work, we propose Quantization-aware Training for Domain Generalization (QT-DoG) and demonstrate that weight quantization effectively leads to flatter minima in the loss landscape, thereby enhancing domain generalization. Unlike traditional quantization methods focused on model compression, QT-DoG exploits quantization as an implicit regularizer by inducing noise in model weights, guiding the optimization process toward flatter minima that are less sensitive to perturbations and overfitting. We provide both an analytical perspective and empirical evidence demonstrating that quantization inherently encourages flatter minima, leading to better generalization across domains. Moreover, with the benefit of reducing the model size through quantization, we demonstrate that an ensemble of multiple quantized models further yields superior accuracy than the state-of-the-art DG approaches with no computational or memory overheads. Code is released at:https://saqibjaved1.github.io/QT_DoG/.

Paperid:1086 poster

Authors:Xupeng Zhu, Fan Wang, Robin Walters, Jane Shi

Title: SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space

Abstract:
Diffusion Policies are effective at learning closed-loop manipulation policies from human demonstrations but generalize poorly to novel arrangements of objects in 3D space, hurting real-world performance. To address this issue, we propose Spherical Diffusion Policy (SDP), an SE(3) equivariant diffusion policy that adapts trajectories according to 3D transformations of the scene. Such equivariance is achieved by embedding the states, actions, and the denoising process in spherical Fourier space. Additionally, we employ novel spherical FiLM layers to condition the action denoising process equivariantly on the scene embeddings. Lastly, we propose a spherical denoising temporal U-net that achieves spatiotemporal equivariance with computational efficiency. In the end, SDP is end-to-end SE(3) equivariant, allowing robust generalization across transformed 3D scenes. SDP demonstrates a large performance improvement over strong baselines in 20 simulation tasks and 5 physical robot tasks including single-arm and bi-manual embodiments. Code is available athttps://github.com/amazon-science/Spherical_Diffusion_Policy.

Paperid:1087 poster

Authors:Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, Chuang Gan

Title: AdaWorld: Learning Adaptable World Models with Latent Actions

Abstract:
World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.

Paperid:1088 poster

Authors:Chuheng Zhang, Wei Shen, Li Zhao, Xuyun Zhang, Xiaolong Xu, Wanchun Dou, Jiang Bian

Title: Policy Filtration for RLHF to Mitigate Noise in Reward Models

Abstract:
While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination ($R^2$) between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real performance. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation and math reasoning tasks. In code generation, PF-PPO achieves the state-of-the-art performance of 7-billion-parameter models on HumanEval (+7.9%), MBPP (+0.7%), and LeetCode Contest (+10.0%) which is a more challenging benchmark created by us. In math reasoning, PF-PPO yields performance increase using different reward models and benchmarks (Ape210K and CMATH).

Paperid:1089 poster

Authors:Yao Zhu, Zhenyuan LI, Yangyang Wei, Shouling Ji

Title: The Case for Learned Provenance-based System Behavior Baseline

Abstract:
Provenance graphs describe data flows and causal dependencies of host activities, enabling to track the data propagation and manipulation throughout the systems, which provide a foundation for intrusion detection. However, these Provenance-based Intrusion Detection Systems (PIDSes) face significant challenges in storage, representation, and analysis, which impede the efficacy of machine learning models such as Graph Neural Networks (GNNs) in processing and learning from these graphs. This paper presents a novel learning-based anomaly detection method designed to efficiently embed and analyze large-scale provenance graphs. Our approach integrates dynamic graph processing with adaptive encoding, facilitating compact embeddings that effectively address out-of-vocabulary (OOV) elements and adapt to normality shifts in dynamic real-world environments. Subsequently, we incorporate this refined baseline into a tag-propagation framework for real-time detection. Our evaluation demonstrates the method's accuracy and adaptability in anomaly path mining, significantly advancing the state-of-the-art in handling and analyzing provenance graphs for anomaly detection.

Paperid:1090 poster

Authors:Tushar Aggarwal, Swayam Singh, Abhijeet Awasthi, Aditya Kanade, Nagarajan Natarajan

Title: NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits

Abstract:
Software engineering activities frequently involve edits to existing code. However, contemporary code language models (LMs) lack the ability to handle diverse types of code-edit requirements. In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm. Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity. Today's code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning. To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting. Using our approach, we obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach on two model families DeepSeekCoder and QwenCoder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation and general problem-solving abilities post adaptation. We opensource the models, synthetic dataset, and implementation athttp://aka.ms/nextcoder.

Paperid:1091 poster

Authors:Arnas Uselis, Andrea Dittadi, Seong Joon Oh

Title: Does Data Scaling Lead to Visual Compositional Generalization?

Abstract:
Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will improve out-of-distribution performance, including compositional generalization. We test this premise through controlled experiments that systematically vary data scale, concept diversity, and combination coverage. We find that compositional generalization is driven by data diversity, not mere data scale. Increased combinatorial coverage forces models to discover a linearly factored representational structure, where concepts decompose into additive components. We prove this structure is key to efficiency, enabling perfect generalization from few observed combinations. Evaluating pretrained models (DINO, CLIP), we find above-random yet imperfect performance, suggesting partial presence of this structure. Our work motivates stronger emphasis on constructing diverse datasets for compositional generalization, and considering the importance of representational structure that enables efficient compositional learning.

Paperid:1092 poster

Authors:Haike Xu, Zongyu Lin, Kai-Wei Chang, Yizhou Sun, Piotr Indyk

Title: Contradiction Retrieval via Contrastive Learning with Sparsity

Abstract:
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and cross-encoder models exhibit different limitations. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We conduct contradiction retrieval experiments on Arguana, MSMARCO, and HotpotQA, where our method produces an average improvement of $11.0%$ across different models. We also validate our method on downstream tasks like natural language inference and cleaning corrupted corpora. This paper outlines a promising direction for non-similarity-based information retrieval which is currently underexplored.

Paperid:1093 poster

Authors:Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Keunho Jang, Zheng Hui

Title: Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Abstract:
Large language models (LLMs) show potential as computer agents, enhancing productivity and software accessibility in multi-modal tasks. However, measuring agent performance in sufficiently realistic and complex environments becomes increasingly challenging as: (i) most benchmarks are limited to specific modalities/domains (e.g., text-only, web navigation, Q&A) and (ii) full benchmark evaluations are slow (on order of magnitude of multiple hours/days) given the multi-step sequential nature of tasks. To address these challenges, we introduce Windows Agent Arena: a general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real OS to use the same applications and tools available to human users when performing tasks. We create 150+ diverse tasks across representative domains that require agentic abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized for a full benchmark evaluation in as little as $20$ minutes. Our work not only speeds up the development and evaluation cycle of multi-modal agents, but also highlights and analyzes existing shortfalls in the agentic abilities of several multimodal LLMs as agents within the Windows computing environment---with the best achieving only a 19.5% success rate compared to a human success rate of 74.5%.

Paperid:1094 poster

Authors:Catherine Chen, Jingyan Shen, Zhun Deng, Lihua Lei

Title: Conformal Tail Risk Control for Large Language Model Alignment

Abstract:
Recent developments in large language models (LLMs) have led to their widespread usage for various tasks. The prevalence of LLMs in society implores the assurance on the reliability of their performance. In particular, risk-sensitive applications demand meticulous attention to unexpectedly poor outcomes, i.e., tail events, for instance, toxic answers, humiliating language, and offensive outputs. Due to the costly nature of acquiring human annotations, general-purpose scoring models have been created to automate the process of quantifying these tail events. This phenomenon introduces potential human-machine misalignment between the respective scoring mechanisms. In this work, we present a lightweight calibration framework for blackbox models that ensures the alignment of humans and machines with provable guarantees. Our framework provides a rigorous approach to controlling any distortion risk measure that is characterized by a weighted average of quantiles of the loss incurred by the LLM with high confidence. The theoretical foundation of our method relies on the connection between conformal risk control and a traditional family of statistics, i.e., L-statistics. To demonstrate the utility of our framework, we conduct comprehensive experiments that address the issue of human-machine misalignment.

Paperid:1095 poster

Authors:Sushant Agarwal, Amit Deshpande, Rajmohan Rajaraman, Ravi Sundaram

Title: Optimal Fair Learning Robust to Adversarial Distribution Shift

Abstract:
Previous work in fair machine learning has characterised the Fair Bayes Optimal Classifier (BOC) on a given distribution for both deterministic and randomized classifiers. We study the robustness of the Fair BOC to adversarial noise in the data distribution. Kearns & Li (1988) implies that the accuracy of the deterministic BOC without any fairness constraints is robust (Lipschitz) to malicious noise in the data distribution. We demonstrate that their robustness guarantee breaks down when we add fairness constraints. Hence, we consider the randomized Fair BOC, and our central result is that its accuracy is robust to malicious noise in the data distribution. Our robustness result applies to various fairness constraints---Demographic Parity, Equal Opportunity, Predictive Equality. Beyond robustness, we demonstrate that randomization leads to better accuracy and efficiency. We show that the randomized Fair BOC is nearly-deterministic, and gives randomized predictions on at most one data point, hence availing numerous benefits of randomness, while using very little of it.

Paperid:1096 poster

Authors:Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang

Title: SlimLLM: Accurate Structured Pruning for Large Language Models

Abstract:
Large language models(LLMs) have garnered significant attention and demonstrated impressive capabilities in a wide range of applications. However, due to their enormous computational costs, the deployment and application of LLMs are often severely limited. To address this issue, structured pruning is an effective solution to compress the parameters of LLMs. Determining the importance of each sub-module in LLMs and minimizing performance loss are critical issues that need to be carefully addressed in structured pruning. In this paper, we propose an effective and fast structured pruning method named SlimLLM for large language models. For channel and attention head pruning, we evaluate the importance based on the entire channel or head, rather than merely aggregating the importance of individual elements within a sub-module. This approach enables a more holistic consideration of the interdependence among elements within the sub-module. In addition, we design a simple linear regression strategy for the output matrix to quickly recover performance. We also propose layer-based importance ratio to determine the pruning ratio for each layer. Based on the LLaMA benchmark results, our SlimLLM outperforms other methods and achieves state-of-the-art performance.

Paperid:1097 poster

Authors:Gaurav Rohit Ghosal, Pratyush Maini, Aditi Raghunathan

Title: Memorization Sinks: Isolating Memorization during LLM Training

Abstract:
Large language models are susceptible to memorizing repeated sequences, posing privacy and copyright concerns. A popular mitigation strategy is to remove memorized information from specific neurons post-hoc. However, such approaches have shown limited success so far. In a controlled setting, we show that the memorization ofnaturalsequences (those that resemble linguistically plausible text) becomemechanistically entangledwith general language abilities, thereby becoming challenging to remove post-hoc. In this work, we put forward a new paradigm of MemSinks that promotes isolation of memorization by design. We leverage a sequence identifier to activate a unique set of memorization neurons for each sequence across repetitions. By analyzing the dynamics of learning and forgetting, we argue that MemSinks facilitates clean isolation of memorized content, making it easier to remove without compromising general language capabilities. We implement MemSinks at the billion-parameter and billion-token scale, and observe both effective isolation and strong generalization. To our knowledge, this is the first proof-of-concept on real data demonstrating that simultaneous generalization and isolation is achievable. We open-source our code athttp://github.com/grghosal/MemSinks.

Paperid:1098 poster

Authors:Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar

Title: Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Abstract:
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for optimizing test-time compute. Fine-tuning with MRT leads to substantial improvements in both performance and token efficiency on the AIME dataset.

Paperid:1099 poster

Authors:Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, Nahum Maru, Hristo Todorov, Etash Kumar Guha, E. Kelly Buchanan, Mayee F Chen, Neel Guha, Christopher Re, Azalia Mirhoseini

Title: An Architecture Search Framework for Inference-Time Techniques

Abstract:
Inference-time techniques, such as repeated sampling or iterative revisions, are emerging as powerful ways to enhance large-language models (LLMs) at test time. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of each technique across models and tasks, the interactions between them, and the massive search space for combining them. To address these challenges, we introduce Archon, a modular and automated framework for optimizing the process of selecting and combining inference-time techniques and LLMs. Given a compute budget and a set of available LLMs, Archon explores a large design space to discover optimized configurations tailored to target benchmarks. It can design custom or general-purpose architectures that advance the Pareto frontier of accuracy vs. maximum token budget compared to top-performing baselines. Across instruction-following, reasoning, and coding tasks, we show that Archon can leverage additional inference compute budget to design systems that outperform frontier models such as OpenAI’s o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1%.

Paperid:1100 poster

Authors:Alireza Kazemipour, Matthew E. Taylor, Michael Bowling

Title: Model-Based Exploration in Monitored Markov Decision Processes

Abstract:
A tenet of reinforcement learning is that the agent always observes rewards. However, this is not true in many realistic settings, e.g., a human observer may not always be available to provide rewards, sensors may be limited or malfunctioning, or rewards may be inaccessible during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed to model such settings. However, existing Mon-MDP algorithms have several limitations: they do not fully exploit the problem structure, cannot leverage a known monitor, lack worst-case guarantees for "unsolvable" Mon-MDPs without specific initialization, and offer only asymptotic convergence proofs. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that observable rewards are reliably captured, and another to learn the minimax-optimal policy. Second, we empirically demonstrate the advantages. We show faster convergence than prior algorithms in more than four dozen benchmarks, and even more dramatic improvements when the monitoring process is known. Third, we present the first finite-sample bound on performance. We show convergence to a minimax-optimal policy even when some rewards are never observable.

Paperid:1101 poster

Authors:Onno Eberhard, Michael Muehlebach, Claire Vernade

Title: Partially Observable Reinforcement Learning with Memory Traces

Abstract:
Partially observable environments present a considerable computational challenge in reinforcement learning due to the need to consider long histories. Learning with a finite window of observations quickly becomes intractable as the window length grows. In this work, we introducememory traces. Inspired by eligibility traces, these are compact representations of the history of observations in the form of exponential moving averages. We prove sample complexity bounds for the problem of offline on-policy evaluation that quantify the return errors achieved with memory traces for the class of Lipschitz continuous value estimates. We establish a close connection to the window approach, and demonstrate that, in certain environments, learning with memory traces is significantly more sample efficient. Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control.

Paperid:1102 poster

Authors:Mohamad Fares El Hajj Chehade, WENTING LI, Brian Wesley Bell, Russell Bent, Saif R. Kazi, Hao Zhu

Title: LEVIS: Large Exact Verifiable Input Spaces for Neural Networks

Abstract:
The robustness of neural networks is crucial in safety-critical applications, where identifying a reliable input space is essential for effective model selection, robustness evaluation, and the development of reliable control strategies. Most existing robustness verification methods assess the worst-case output under the assumption that the input space is known. However, precisely identifying a verifiable input space $ \mathcal{C} $, where no adversarial examples exist, is challenging due to the possible high dimensionality, discontinuity, and non-convex nature of the input space. To address this challenge, we propose a novel framework,LEVIS, comprisingLEVIS-$\alpha$andLEVIS-$\beta$.LEVIS-$\alpha$identifies a single, large verifiable ball that intersects at least two boundaries of a bounded region $ \mathcal{C} $, whileLEVIS-$\beta$systematically captures the entirety of the verifiable space by integrating multiple verifiable balls. Our contributions are fourfold: we introduce a verification framework,LEVIS, incorporating two optimization techniques for computing nearest and directional adversarial points based on mixed-integer programming (MIP); to enhance scalability, we integrate complementary constrained (CC) optimization with a reduced MIP formulation, achieving up to a 17-fold reduction in runtime by approximating the verifiable region in a principled way; we provide a theoretical analysis characterizing the properties of the verifiable balls obtained throughLEVIS-$\alpha$; and we validate our approach across diverse applications, including electrical power flow regression and image classification, demonstrating performance improvements and visualizing the geometric properties of the verifiable region.

Paperid:1103 poster

Authors:Xize Liang, Chao Chen, Shuang Qiu, Jie Wang, Yue Wu, Zhihang Fu, Hanzhu Chen, Feng Wu, Jieping Ye

Title: ROPO: Robust Preference Optimization for Large Language Models

Abstract:
The prevalent noise in the preference data unavoidably poses significant challenges to the preference alignment of large language models (LLMs). Existing efforts for this problem either marginally alleviate the impact of noise without noise reduction, or rely on external LLMs that incur substantial computational costs. To address these challenges, we proposeRObustPreferenceOptimization (ROPO), an iterative alignment approach that integratesnoise-toleranceandnoise filteringwithout the aid of external models. Specifically, ROPO first formulates the training process with adaptive noise reduction as an optimization problem, which can be efficiently solved in an iterative paradigm. Then, to equip this solving process with noise-tolerance and noise-identification capabilities, we derive a robust loss that suppresses the gradients from samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is key to the noise-tolerance and effective filtering of noisy samples. The derived loss further inspires a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Extensive experiments on several widely-used datasets and model architectures demonstrate that ROPO significantly outperforms all baselines underfourpractical noise settings and the random symmetric noise, with its advantage increasing as the noise rate increases.

Paperid:1104 poster

Authors:Christian Komusiewicz, Andre Schidler, Frank Sommer, Manuel Sorge, Luca Pascal Staus

Title: Learning Minimum-Size BDDs: Towards Efficient Exact Algorithms

Abstract:
Binary decision diagrams (BDDs) are widely applied tools to compactly represent labeled data as directed acyclic graphs; for efficiency and interpretability reasons small BDDs are preferred. Given labeled data, minimizing BDDs is NP-complete and thus recent research focused on the influence of parameters such as the solution size $s$ on the complexity [Ordyniak et al., AAAI 2024]. Our main positive result is an algorithm that is efficient if in particular $s$, the domain size $D$, and the Hamming distance between any two data points is small, improving on previous running-time bounds. This algorithm is inspired by the witness-tree paradigm that was recently successful for computing decision trees [Komusiewicz et al., ICML 2023], whose extension to BDDs was open. We extend our algorithmic results to the case where we allow a small number of misclassified data points and complement them with lower bounds that show that the running times are tight from multiple points of view. We show that our main algorithm holds practical promise by providing a proof-of-concept implementation.

Paperid:1105 poster

Authors:Jourdain Lamperski, Haeseong Yang, Oleg Prokopyev

Title: Simple Randomized Rounding for Max-Min Eigenvalue Augmentation

Abstract:
We consider themax-min eigenvalue augmentationproblem: given $n \times n$ symmetric positive semidefinite matrices $M,A_1,\ldots, A_m$ and a positive integer $k < m$, the goal is to choose a subset $I \subset {1,\ldots,m}$ of cardinality at most $k$ that maximizes the minimum eigenvalue of the matrix $M + \sum_{i \in I} A_i$. The problem captures both theBayesian E-optimal designandmaximum algebraic connectivity augmentationproblems. In contrast to the existing work, we do not assume that theaugmentation matricesare rank-one matrices, and we focus on the setting in which $k < n$. We show that asimplerandomized rounding method provides a constant-factor approximation if theoptimal increaseis sufficiently large, specifically, if $\mathrm{OPT} - \lambda_{\mathrm{min}}(M) = \Omega(R \ln k)$, where $\mathrm{OPT}$ is the optimal value, and $R$ is the maximum trace of an augmentation matrix. To establish the guarantee, we derive a matrix concentration inequality that is of independent interest. The inequality can be interpreted as anintrinsic dimensionanalog of the matrix Chernoff inequality for the minimum eigenvalue of a sum of independent random positive semidefinite matrices; such an inequality has already been established for the maximum eigenvalue, but not for the minimum eigenvalue.

Paperid:1106 poster

Authors:Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Zhang Siyuan, Hailiang Yao, Bin Liu, Rui Liu, Shan Liang, Ya Li, Jiangyan Yi, Jianhua Tao

Title: OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition

Abstract:
Multimodal Emotion Recognition (MER) is a critical research area that seeks to decode human emotions from diverse data modalities. However, existing machine learning methods predominantly rely on predefined emotion taxonomies, which fail to capture the inherent complexity, subtlety, and multi-appraisal nature of human emotional experiences, as demonstrated by studies in psychology and cognitive science. To overcome this limitation, we advocate for introducing the concept ofopen vocabularyinto MER. This paradigm shift aims to enable models to predict emotions beyond a fixed label space, accommodating a flexible set of categories to better reflect the nuanced spectrum of human emotions. To achieve this, we propose a novel paradigm:Open-Vocabulary MER (OV-MER), which enables emotion prediction without being confined to predefined spaces. However, constructing a dataset that encompasses the full range of emotions for OV-MER is practically infeasible; hence, we present a comprehensive solution including a newly curated database, novel evaluation metrics, and a preliminary benchmark. By advancing MER from basic emotions to more nuanced and diverse emotional states, we hope this work can inspire the next generation of MER, enhancing its generalizability and applicability in real-world scenarios. Code and dataset are available at:https://github.com/zeroQiaoba/AffectGPT.

Paperid:1107 poster

Authors:Chen Hao Xia, Manasa Kaniselvan, Alexandros Nikolaos Ziogas, Marko Mladenović, Rayen Mahjoub, Alexander Maeder, Mathieu Luisier

Title: Learning the Electronic Hamiltonian of Large Atomic Structures

Abstract:
Graph neural networks (GNNs) have shown promise in learning the ground-state electronic properties of materials, subvertingab initiodensity functional theory (DFT) calculations when the underlying lattices can be represented as small and/or repeatable unit cells (i.e., molecules and periodic crystals). Realistic systems are, however, non-ideal and generally characterized by higher structural complexity. As such, they require large (10+ Å) unit cells and thousands of atoms to be accurately described. At these scales, DFT becomes computationally prohibitive, making GNNs especially attractive. In this work, we present a strictly local equivariant GNN capable of learning the electronic Hamiltonian (H) of realistically extended materials. It incorporates anaugmented partitioningapproach that enables training on arbitrarily large structures while preserving local atomic environments beyond boundaries. We demonstrate its capabilities by predicting the electronic Hamiltonian of various systems with up to 3,000 nodes (atoms), 500,000+ edges, ~ 28 million orbital interactions (nonzero entries ofH), and $\leq$0.53% error in the eigenvalue spectra. Our work expands the applicability of current electronic property prediction methods to some of the most challenging cases encountered in computational materials science, namely systems with disorder, interfaces, and defects.

Paperid:1108 poster

Authors:Jisu Han, Jaemin Na, Wonjun Hwang

Title: Ranked Entropy Minimization for Continual Test-Time Adaptation

Abstract:
Test-time adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model's probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results.

Paperid:1109 poster

Authors:Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, Mihaela van der Schaar

Title: Autoformulation of Mathematical Optimization Models Using LLMs

Abstract:
Mathematical optimization is fundamental to decision-making across diverse domains, from operations research to healthcare. Yet, translating real-world problems into optimization models remains a difficult task, often demanding specialized expertise. This paper approaches the problem of $\textit{autoformulation}$: the automated creation of solver-ready optimization models from natural language problem descriptions. We identify three core challenges of autoformulation: $\textit{(1)}$ the vast, problem-dependent hypothesis space, $\textit{(2)}$ efficient and diverse exploration of this space under uncertainty, and $\textit{(3)}$ evaluation of formulation correctness against problem description. To address these challenges, we present a novel method leveraging $\textit{Large Language Models}$ (LLMs) with $\textit{Monte-Carlo Tree Search}$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations. To enhance search efficiency, we introduce symbolic pruning to eliminate trivially equivalent search paths (branches), and employ LLM-based evaluation of partial formulations to guide search. Empirical analysis on linear and mixed-integer programming benchmarks demonstrates our method's effectiveness, with significant performance gains from both LLM-based value estimation and symbolic pruning techniques.

Paperid:1110 poster

Authors:Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, Yuyu Luo

Title: Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search

Abstract:
Text-to-SQL, which enables natural language interaction with databases, serves as a pivotal method across diverse industries. With new, more powerful large language models (LLMs) emerging every few months, fine-tuning has become incredibly costly, labor-intensive, and error-prone. As an alternative,zero-shotText-to-SQL, which leverages the growing knowledge and reasoning capabilities encoded in LLMs without task-specific fine-tuning, presents a promising and more challenging direction. To address this challenge, we propose Alpha-SQL, a novel approach that leverages a Monte Carlo Tree Search (MCTS) framework to iteratively infer SQL construction actions based on partial reasoning states. To enhance the framework’s reasoning capabilities, we introduceLLM-as-Action-Modelto dynamically generate SQL constructionactionsduring the MCTS process, steering the search toward more promising SQL queries. Moreover, Alpha-SQL employs a self-supervised reward function to evaluate the quality of candidate SQL queries, ensuring more accurate and efficient query generation. Experimental results show that Alpha-SQL achieves 69.7% execution accuracy on the BIRD development set, using a 32B open-source LLM without fine-tuning. Alpha-SQL outperforms the best previous zero-shot approach based on GPT-4o by 2.5% on the BIRD development set.

Paperid:1111 poster

Authors:Harshvardhan Agarwal, Sunita Sarawagi

Title: The Missing Alignment Link of In-context Learning on Sequences

Abstract:
Large language models (LLMs) have demonstrated the capability to perform in-context learning (ICL) for completely unseen tasks in classification or language completion. Sequence to sequence (seq2seq) is another popular task category with several applications seeking quick adaptation with ICL. We present a systematic analysis of the ICL capability of LLMs on Seq2Seq tasks using a formal structured language-pair. Our study reveals a critical limitation: except for very short input sequences, ICL fails to achieve consistent learning across all output positions. This exposes a fundamental weakness of modern LLMs �?their inability to effectively uncover the alignment between input and output sequences. Consequently, this limitation results in incomplete induction heads, which are essential for in-context learning of new discrete mappings.To address these limitations, we propose ICA-Tune, a method for focused fine-tuning of an LLM using in-context examples. We present a mechanistic evaluation with two accuracy probes to show how input-output alignment emerges in middle layers of an LLM without direct supervision. This alignment leads to an abrupt jump in the completeness of the induction heads in higher layers. We show that, compared to standard fine-tuning, ICA-Tune enables more sample efficient learning and better generalization to OOD instances.

Paperid:1112 poster

Authors:Vladimir Braverman, Prathamesh Dharangutte, Shaofeng H.-C. Jiang, Hoai-An Nguyen, Chen Wang, Yubo Zhang, Samson Zhou

Title: Relative Error Fair Clustering in the Weak-Strong Oracle Model

Abstract:
We study fair clustering problems in a setting where distance information is obtained from two sources: a strong oracle providing exact distances, but at a high cost, and a weak oracle providing potentially inaccurate distance estimates at a low cost. The goal is to produce a near-optimal fair clustering on $n$ input points with a minimum number of strong oracle queries. This models the increasingly common trade-off between accurate but expensive similarity measures (e.g., large-scale embeddings) and cheaper but inaccurate alternatives. The study of fair clustering in the model is motivated by the important quest of achieving fairness with the presence of inaccurate information. We achieve the first $(1+\varepsilon)$-coresets for fair $k$-median clustering using $\text{poly}\left(\frac{k}{\varepsilon}\cdot\log n\right)$ queries to the strong oracle. Furthermore, our results imply coresets for the standard setting (without fairness constraints), and we could in fact obtain $(1+\varepsilon)$-coresets for $(k,z)$-clustering for general $z=O(1)$ with a similar number of strong oracle queries. In contrast, previous results achieved a constant-factor $(>10)$ approximation for the standard $k$-clustering problems, and no previous work considered the fair $k$-median clustering problem.

Paperid:1113 poster

Authors:Pierre Boyeau, Anastasios Nikolas Angelopoulos, Tianle Li, Nir Yosef, Jitendra Malik, Michael I. Jordan

Title: AutoEval Done Right: Using Synthetic Data for Model Evaluation

Abstract:
The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased.

Paperid:1114 poster

Authors:Bardh Prenkaj, Efstratios Zaradoukas, Gjergji Kasneci

Title: Graph Inverse Style Transfer for Counterfactual Explainability

Abstract:
Counterfactual explainability seeks to uncover model decisions by identifying minimal changes to the input that alter the predicted outcome. This task becomes particularly challenging for graph data due to preserving structural integrity and semantic meaning. Unlike prior approaches that rely on forward perturbation mechanisms, we introduce Graph Inverse Style Transfer (GIST), the first framework to re-imagine graph counterfactual generation as a backtracking process, leveraging spectral style transfer. By aligning the global structure with the original input spectrum and preserving local content faithfulness, GIST produces valid counterfactuals as interpolations between the input style and counterfactual content. Tested on 8 binary and multi-class graph classification benchmarks, GIST achieves a remarkable +7.6% improvement in the validity of produced counterfactuals and significant gains (+45.5%) in faithfully explaining the true class distribution. Additionally, GIST's backtracking mechanism effectively mitigates overshooting the underlying predictor's decision boundary, minimizing the spectral differences between the input and the counterfactuals. These results challenge traditional forward perturbation methods, offering a novel perspective that advances graph explainability.

Paperid:1115 poster

Authors:Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky

Title: Unifying Specialized Visual Encoders for Video Language Models

Abstract:
Recent advances in vision backbones have yielded powerful and diverse visual and video encoders. Yet, current Video Large Language Models encode visual inputs using an encoder from a single backbone family, limiting the amount and type of visual information they can process. We propose MERV, a Multi-Encoder Video Representation, which utilizes multiple encoders for a comprehensive video representation. To optimize heterogeneous features from a broad spectrum of encoders and ensure efficient and coherent feature integration, MERV first aligns encoder features spatio-temporally, then projects them into a unified structure, and finally fuses them through cross-attention. Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent single-encoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.

Paperid:1116 poster

Authors:Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong

Title: Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Abstract:
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4X fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios.

Paperid:1117 poster

Authors:Kecen Li, Zhicong Huang, Xinwen Hou, Cheng Hong

Title: GaussMarker: Robust Dual-Domain Watermark for Diffusion Models

Abstract:
As Diffusion Models (DM) generate increasingly realistic images, related issues such as copyright and misuse have become a growing concern. Watermarking is one of the promising solutions. Existing methods inject the watermark into thesingle-domainof initial Gaussian noise for generation, which suffers from unsatisfactory robustness. This paper presents the firstdual-domainDM watermarking approach using a pipelined injector to consistently embed watermarks in both the spatial and frequency domains. To further boost robustness against certain image manipulations and advanced attacks, we introduce a model-independent learnable Gaussian Noise Restorer (GNR) to refine Gaussian noise extracted from manipulated images and enhance detection robustness by integrating the detection scores of both watermarks. GaussMarker efficiently achieves state-of-the-art performance under eight image distortions and four advanced attacks across three versions of Stable Diffusion with better recall and lower false positive rates, as preferred in real applications.

Paperid:1118 poster

Authors:Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, Neel Nanda

Title: Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Abstract:
Sparse autoencoders (SAEs) are a popular method for interpreting concepts represented in large language model (LLM) activations. However, there is a lack of evidence regarding the validity of their interpretations due to the lack of a ground truth for the concepts used by an LLM, and a growing number of works have presented problems with current SAEs. One alternative source of evidence would be demonstrating that SAEs improve performance on downstream tasks beyond existing baselines. We test this by applying SAEs to the real-world task of LLM activation probing in four regimes: data scarcity, class imbalance, label noise, and covariate shift. Due to the difficulty of detecting concepts in these challenging settings, we hypothesize that SAEs�?basis of interpretable, concept-level latents should provide a useful inductive bias. However, although SAEs occasionally perform better than baselines on individual datasets, we are unable to design ensemble methods combining SAEs with baselines that consistently outperform ensemble methods solely using baselines. Additionally, although SAEs initially appear promising for identifying spurious correlations, detecting poor dataset quality, and training multi-token probes, we are able to achieve similar results with simple non-SAE baselines as well. Though we cannot discount SAEs�?utility on other tasks, our findings highlight the shortcomings of current SAEs and the need to rigorously evaluate interpretability methods on downstream tasks with strong baselines.

Paperid:1119 poster

Authors:Sirui Lin, Zijun Gao, Jose Blanchet, Peter Glynn

Title: Tightening Causal Bounds via Covariate-Aware Optimal Transport

Abstract:
Causal estimands can vary significantly depending on the relationship between outcomes in treatment and control groups, leading to wide partial identification (PI) intervals that impede decision making. Incorporating covariates can substantially tighten these bounds, but requires determining the range of PI over probability models consistent with the joint distributions of observed covariates and outcomes in treatment and control groups. This problem is known to be equivalent to a conditional optimal transport (COT) optimization task, which is more challenging than standard optimal transport (OT) due to the additional conditioning constraints. In this work, we study a tight relaxation of COT that effectively reduces it to standard OT, leveraging its well-established computational and theoretical foundations. Our relaxation incorporates covariate information and ensures narrower PI intervals for any value of the penalty parameter, while becoming asymptotically exact as a penalty increases to infinity. This approach preserves the benefits of covariate adjustment in PI and results in a data-driven estimator for the PI set that is easy to implement using existing OT packages. We analyze the convergence rate of our estimator and demonstrate the effectiveness of our approach through extensive simulations, highlighting its practical use and superior performance compared to existing methods.

Paperid:1120 poster

Authors:Minh N. Vu, Geigh Zollicoffer, Huy Mai, Ben Nebgen, Boian Alexandrov, Manish Bhattarai

Title: Topological Signatures of Adversaries in Multimodal Alignments

Abstract:
Multimodal Machine Learning systems, particularly those aligning text and image data like CLIP/BLIP models, have become increasingly prevalent, yet remain susceptible to adversarial attacks. While substantial research has addressed adversarial robustness in unimodal contexts, defense strategies for multimodal systems are underexplored. This work investigates the topological signatures that arise between image and text embeddings and shows how adversarial attacks disrupt their alignment, introducing distinctive signatures. We specifically leverage persistent homology and introduce two novel Topological-Contrastive losses based on Total Persistence and Multi-scale kernel methods to analyze the topological signatures introduced by adversarial perturbations. We observe a pattern of monotonic changes in the proposed topological losses emerging in a wide range of attacks on image-text alignments, as more adversarial samples are introduced in the data. By designing an algorithm to back-propagate these signatures to input samples, we are able to integrate these signatures into Maximum Mean Discrepancy tests, creating a novel class of tests that leverage topological signatures for better adversarial detection.

Paperid:1121 poster

Authors:Xuyang Chen, Lin Zhao

Title: On the Convergence of Continuous Single-timescale Actor-critic

Abstract:
Actor-critic algorithms have been instrumental in boosting the performance of numerous challenging applications involving continuous control, such as highly robust and agile robot motion control. However, their theoretical understanding remains largely underdeveloped. Existing analyses mostly focus on finite state-action spaces and on simplified variants of actor-critic, such as double-loop updates with i.i.d. sampling, which are often impractical for real-world applications. We consider the canonical and widely adopted single-timescale updates with Markovian sampling in continuous state-action space. Specifically, we establish finite-time convergence by introducing a novel Lyapunov analysis framework, which provides a unified convergence characterization of both the actor and the critic. Our approach is less conservative than previous methods and offers new insights into the coupled dynamics of actor-critic updates.

Paperid:1122 poster

Authors:Yuan Li, Jun Hu, Zemin Liu, Bryan Hooi, Jia Chen, Bingsheng He

Title: Adapting Precomputed Features for Efficient Graph Condensation

Abstract:
Graph Neural Networks (GNNs) face significant computational challenges when handling large-scale graphs. To address this, Graph Condensation (GC) methods aim to compress large graphs into smaller, synthetic ones that are more manageable for GNN training. Recently, trajectory matching methods have shown state-of-the-art (SOTA) performance for GC, aligning the model's training behavior on a condensed graph with that on the original graph by guiding the trajectory of model parameters. However, these approaches require repetitive GNN retraining during condensation, making them computationally expensive. To address the efficiency issue, we completely bypass trajectory matching and propose a novel two-stage framework. The first stage, a precomputation stage, performs one-time message passing to extract structural and semantic information from the original graph. The second stage, a diversity-aware adaptation stage, performs class-wise alignment while maximizing the diversity of synthetic features. Remarkably, even with just the precomputation stage, which takes only seconds, our method either matches or surpasses 5 out of 9 baseline results. Extensive experiments show that our approach achieves comparable or better performance while being 96× to 2,455× faster than SOTA methods, making it more practical for large-scale GNN applications. Our code and data are available athttps://github.com/Xtra-Computing/GCPA.

Paperid:1123 poster

Authors:Fan Wang, Feiyu Jiang, Zifeng Zhao, Yi Yu

Title: Transfer Learning for Nonparametric Contextual Dynamic Pricing

Abstract:
Dynamic pricing strategies are crucial for firms to maximize revenue by adjusting prices based on market conditions and customer characteristics. However, designing optimal pricing strategies becomes challenging when historical data are limited, as is often the case when launching new products or entering new markets. One promising approach to overcome this limitation is to leverage information from related products or markets to inform the focal pricing decisions. In this paper, we explore transfer learning for nonparametric contextual dynamic pricing under a covariate shift model, where the marginal distributions of covariates differ between source and target domains while the reward functions remain the same. We propose a novel Transfer Learning for Dynamic Pricing (TLDP) algorithm that can effectively leverage pre-collected data from a source domain to enhance pricing decisions in the target domain. The regret upper bound of TLDP is established under a simple Lipschitz condition on the reward function. To establish the optimality of TLDP, we further derive a matching minimax lower bound, which includes the target-only scenario as a special case and is presented for the first time in the literature. Extensive numerical experiments validate our approach, demonstrating its superiority over existing methods and highlighting its practical utility in real-world applications.

Paperid:1124 poster

Authors:Sahil Goyal, Debapriya Tula, Gagan Jain, Pradeep Shenoy, Prateek Jain, Sujoy Paul

Title: Masked Generative Nested Transformers with Decode Time Scaling

Abstract:
Recent advances in visual generation have made significant strides in producing content of exceptional quality. However, most methods suffer from a fundamental problem - a bottleneck of inference computational efficiency. Most of these algorithms involve multiple passes over a transformer model to generate tokens or denoise inputs. However, the model size is kept consistent throughout all iterations, which makes it computationally expensive. In this work, we aim to address this issue primarily through two key ideas - (a) not all parts of the generation process need equal compute, and we design a decode time model scaling schedule to utilize compute effectively, and (b) we can cache and reuse some of the intermediate computation. Combining these two ideas leads to using smaller models to process more tokens while large models process fewer tokens. These different-sized models do not increase the parameter size, as they share parameters. We rigorously experiment with ImageNet256×256 , UCF101, and Kinetics600 to showcase the efficacy of the proposed method for image/video generation and frame prediction. Our experiments show that with almost3×less compute than baseline, our model obtains competitive performance.

Paperid:1125 poster

Authors:Milad Khademi Nori, IL MIN KIM, Guanghui Wang

Title: Autoencoder-Based Hybrid Replay for Class-Incremental Learning

Abstract:
In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of taskstincreases. Current exemplar replay strategies imposeO(t)memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achievingO(0.1t)at the worst case with the computing complexity ofO(t)while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets. The source code is included in the supplementary material and will be open-sourced upon publication.

Paperid:1126 poster

Authors:Xichen Ye, Yifan Wu, WEIZHONG ZHANG, Cheng Jin, Yifan Chen

Title: Towards Robust Influence Functions with Flat Validation Minima

Abstract:
The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions. However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data. This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk. In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation. Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima. Experimental results across various tasks validate the superiority of our approach.

Paperid:1127 poster

Authors:Xiangzhe Xu, Zian Su, Jinyao Guo, Kaiyuan Zhang, Zhenting Wang, Xiangyu Zhang

Title: ProSec: Fortifying Code LLMs with Proactive Security Alignment

Abstract:
While recent code-specific large language models (LLMs) have greatly enhanced their code generation capabilities, the safety of these models remains under-explored, posing potential risks as insecure code generated by these models may introduce vulnerabilities into real-world systems. Existing methods collect security-focused datasets from real-world vulnerabilities for instruction tuning in order to mitigate such issues. However, they are largely constrained by the data sparsity of vulnerable code, and have limited applicability in the multi-stage post-training workflows of modern LLMs. In this paper, we propose ProSec, a novel proactive security alignment approach designed to align code LLMs with secure coding practices. ProSec systematically exposes the vulnerabilities in a code LLM by synthesizing vulnerability-inducing coding scenarios from Common Weakness Enumerations (CWEs) and generates fixes to vulnerable code snippets, allowing the model to learn secure practices through preference learning objectives. The scenarios synthesized by ProSec trigger 25× more vulnerable code than a normal instruction-tuning dataset, resulting in a security-focused alignment dataset 7× larger than the previous work. Experiments show that models trained with ProSec are 25.2% to 35.4% more secure compared to previous work without degrading models' utility.

Paperid:1128 poster

Authors:Xun Wang, Jing Xu, Franziska Boenisch, Michael Backes, Christopher A. Choquette-Choo, Adam Dziedzic

Title: Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs

Abstract:
Prompting has become a dominant paradigm for adapting large language models (LLMs). While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user's token usage, leaving more space in the context window for task-specific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic forefficiencyandprivacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API. To address these issues, we propose POST (PrivacyOfSoft promptTransfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM. POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally, optionally with differential privacy guarantees, and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.

Paperid:1129 poster

Authors:Qianqian Wang, Mengping Jiang, Zhengming Ding, Quanxue Gao

Title: Unified K-Means Clustering with Label-Guided Manifold Learning

Abstract:
K-Means clustering is a classical and effective unsupervised learning method attributed to its simplicity and efficiency. However, it faces notable challenges, including sensitivity to random initial centroid selection, a limited ability to discover the intrinsic manifold structures within nonlinear datasets, and difficulty in achieving balanced clustering in practical scenarios. To overcome these weaknesses, we introduce a novel framework for K-Means that leverages manifold learning. This approach eliminates the need for centroid calculation and utilizes a cluster indicator matrix to align the manifold structures, thereby enhancing clustering accuracy. Beyond the traditional Euclidean distance, our model incorporates Gaussian kernel distance, K-nearest neighbor distance, and low-pass filtering distance to effectively manage data that is not linearly separable. Furthermore, we introduce a balanced regularizer to achieve balanced clustering results. The detailed experimental results demonstrate the efficacy of our proposed methodology.

Paperid:1130 poster

Authors:Yannis Montreuil, Yeo Shu Heng, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

Title: A Two-Stage Learning-to-Defer Approach for Multi-Task Learning

Abstract:
The Two-Stage Learning-to-Defer (L2D) framework has been extensively studied for classification and, more recently, regression tasks. However, many real-world applications require solving both tasks jointly in a multi-task setting. We introduce a novel Two-Stage L2D framework for multi-task learning that integrates classification and regression through a unified deferral mechanism. Our method leverages a two-stage surrogate loss family, which we prove to be both Bayes-consistent and(G,R)-consistent, ensuring convergence to the Bayes-optimal rejector. We derive explicit consistency bounds tied to the cross-entropy surrogate and theL1-norm of agent-specific costs, and extend minimizability gap analysis to the multi-expert two-stage regime. We also make explicit how shared representation learning—commonly used in multi-task models—affects these consistency guarantees. Experiments on object detection and electronic health record analysis demonstrate the effectiveness of our approach and highlight the limitations of existing L2D methods in multi-task scenarios.

Paperid:1131 poster

Authors:Rongyu Chen, Li'an Zhuo, Linlin Yang, Qi WANG, Liefeng Bo, Bang Zhang, Angela Yao

Title: ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Abstract:
Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

Paperid:1132 poster

Authors:Roy Betser, Meir Yossef Levi, Guy Gilboa

Title: Whitened CLIP as a Likelihood Surrogate of Images and Captions

Abstract:
Likelihood approximations for images are not trivial to compute and can be useful in many applications. We examine the use of Contrastive Language-Image Pre-training (CLIP) to assess the likelihood of images and captions. We introduce Whitened CLIP, a novel transformation of the CLIP latent space via an invertible linear operation. This transformation ensures that each feature in the embedding space has zero mean, unit standard deviation, and no correlation with all other features, resulting in an identity covariance matrix. We show that the whitened embedding statistics can be well approximated by a standard normal distribution, allowing log-likelihood to be estimated using the squared Euclidean norm in the whitened space. The whitening procedure is completely training-free and uses a precomputed whitening matrix, making it extremely fast. We present several preliminary experiments demonstrating the properties and applicability of these likelihood scores to images and captions. Our code is available at github.com/rbetser/W_CLIP/tree/main.

Paperid:1133 poster

Authors:Grace Luo, Trevor Darrell, Amir Bar

Title: Vision-Language Models Create Cross-Modal Task Representations

Abstract:
Autoregressive vision-language models (VLMs) can handle many tasks within a single model, yet the representations that enable this capability remain opaque. We find that VLMs align conceptually equivalent inputs into a shared task vector, which is invariant to modality (text, image) and format (examples, instruction), and may simplify VLM processing. We measure this alignment via cross-modal transfer--the ability of a task vector derived in one modality to trigger the correct generation in another--on a range of tasks and model architectures. Although the task vector is highly compressed, we find that this single vector outperforms prompting the model with the full task information, unique to this cross-modal case. Furthermore, we show that task vectors can be transferred from a base language model to its fine-tuned vision-language counterpart, and that they can be derived solely from instructions without the need for examples. Taken together, our findings shed light on how VLMs internally process task information, and how they map different modalities into common semantic representations.

Paperid:1134 poster

Authors:Wenke Huang, Jian Liang, Guancheng Wan, Didi Zhu, He Li, Jiawei Shao, Mang Ye, Bo Du, Dacheng Tao

Title: Be Confident: Uncovering Overfitting in MLLM Multi-Task Tuning

Abstract:
Fine-tuning Multimodal Large Language Models (MLLMs) in multi-task learning scenarios has emerged as an effective strategy for achieving cross-domain specialization. However, multi-task fine-tuning frequently induces performance degradation on open-response datasets. We posit that free-form answer generation primarily depends on language priors, and strengthening the integration of visual behavioral cues is critical for enhancing prediction robustness. In this work, we propose Noise Resilient Confidence Alignment to address the challenge of open-response overfitting during multi-task fine-tuning. Our approach prioritizes maintaining consistent prediction patterns in MLLMs across varying visual input qualities. To achieve this, we employ Gaussian perturbations to synthesize distorted visual inputs and enforce token prediction confidence alignment towards the normal visual branch. By explicitly linking confidence calibration to visual robustness, this method reduces over-reliance on language priors. We conduct extensive empirical evaluations across diverse multi-task downstream settings via popular MLLM architectures. The comprehensive experiment demonstrates the effectiveness of our method, showcasing its ability to alleviate open-response overfitting while maintaining satisfying multi-task fine-tuning performance.

Paperid:1135 poster

Authors:Hai Ci, Yiren Song, Pei Yang, Jinheng Xie, Mike Zheng Shou

Title: WMAdapter: Adding WaterMark Control to Latent Diffusion Models

Abstract:
Watermarking is essential for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that embeds user-specified watermark information seamlessly during the diffusion generation process. Unlike previous methods that modify diffusion modules to incorporate watermarks, WMAdapter is designed to keep all diffusion components intact, resulting in sharp, artifact-free images. To achieve this, we introduce two key innovations: (1) We develop a contextual adapter that conditions on the content of the cover image to generate adaptive watermark embeddings. (2) We implement an additional finetuning step and a hybrid finetuning strategy that suppresses noticeable artifacts while preserving the integrity of the diffusion components. Empirical results show that WMAdapter provides strong flexibility, superior image quality, and competitive watermark robustness.

Paperid:1136 poster

Authors:Weiwei Liu

Title: Nonconvex Theory ofM-estimators with Decomposable Regularizers

Abstract:
High-dimensional inference addresses scenarios where the dimension of the data approaches, or even surpasses, the sample size. In these settings, the regularizedM-estimator is a common technique for inferring parameters. (Negahban et al., 2009) establish a unified framework for establishing convergence rates in the context of high-dimensional scaling, demonstrating that estimation errors are confined within a restricted set, and revealing fast convergence rates. The key assumption underlying their work is the convexity of the loss function. However, many loss functions in high-dimensional contexts are nonconvex. This leads to the question: if the loss function is nonconvex, do estimation errors still fall within a restricted set? If yes, can we recover convergence rates of the estimation error under nonconvex situations? This paper provides affirmative answers to these critical questions.

Paperid:1137 poster

Authors:Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Srinath Namburi GNVV, Nada Magdi Elkordi, Yin Li

Title: LETS Forecast: Learning Embedology for Time Series Forecasting

Abstract:
Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at:https://abrarmajeedi.github.io/deep_edm.

Paperid:1138 poster

Authors:Tianlang Chen, Charilaos Kanatsoulis, Jure Leskovec

Title: RelGNN: Composite Message Passing for Relational Deep Learning

Abstract:
Predictive tasks on relational databases are critical in real-world applications spanning e-commerce, healthcare, and social media. To address these tasks effectively, Relational Deep Learning (RDL) encodes relational data as graphs, enabling Graph Neural Networks (GNNs) to exploit relational structures for improved predictions. However, existing RDL methods often overlook the intrinsic structural properties of the graphs built from relational databases, leading to modeling inefficiencies, particularly in handling many-to-many relationships. Here we introduce RelGNN, a novel GNN framework specifically designed to leverage the unique structural characteristics of the graphs built from relational databases. At the core of our approach is the introduction of atomic routes, which are simple paths that enable direct single-hop interactions between the source and destination nodes. Building upon these atomic routes, RelGNN designs new composite message passing and graph attention mechanisms that reduce redundancy, highlight key signals, and enhance predictive accuracy. RelGNN is evaluated on 30 diverse real-world tasks from Relbench (Fey et al., 2024), and achieves state-of-the-art performance on the vast majority of tasks, with improvements of up to 25%.

Paperid:1139 poster

Authors:Lei Zhang, Jiaxi Yang, Min Yang, Jian Yang, Mouxiang Chen, Jiajun Zhang, Zeyu Cui, Binyuan Hui, Junyang Lin

Title: Synthesizing Software Engineering Data in a Test-Driven Manner

Abstract:
We introduceSWE-Flow, a novel data synthesis framework grounded in Test-Driven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues,SWE-Flowautomatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core ofSWE-Flowis the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-stepdevelopment schedule. At each step,SWE-Flowproduces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating theSWE-Flow-Evalbenchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images atGithub.

Paperid:1140 poster

Authors:François R J Cornet, Federico Bergamin, Arghya Bhowmik, Juan Maria Garcia-Lastra, Jes Frellsen, Mikkel N. Schmidt

Title: Kinetic Langevin Diffusion for Crystalline Materials Generation

Abstract:
Generative modeling of crystalline materials using diffusion models presents a series of challenges: the data distribution is characterized by inherent symmetries and involves multiple modalities, with some defined on specific manifolds. Notably, the treatment of fractional coordinates representing atomic positions in the unit cell requires careful consideration, as they lie on a hypertorus. In this work, we introduce Kinetic Langevin Diffusion for Materials (KLDM), a novel diffusion model for crystalline materials generation, where the key innovation resides in the modeling of the coordinates. Instead of resorting to Riemannian diffusion on the hypertorus directly, we generalize Trivialized Diffusion Model (TDM) to account for the symmetries inherent to crystals. By coupling coordinates with auxiliary Euclidean variables representing velocities, the diffusion process is now offset to a flat space. This allows us to effectively perform diffusion on the hypertorus while providing a training objective that accounts for the periodic translation symmetry of the true data distribution. We evaluate KLDM on both Crystal Structure Prediction (CSP) and De-novo Generation (DNG) tasks, demonstrating its competitive performance with current state-of-the-art models.

Paperid:1141 poster

Authors:Niket Nikul Patel, Guido Montufar

Title: On the Local Complexity of Linear Regions in Deep ReLU Networks

Abstract:
We define thelocal complexityof a neural network with continuous piecewise linear activations as a measure of the density of linear regions over an input data distribution. We show theoretically that ReLU networks that learn low-dimensional feature representations have a lower local complexity. This allows us to connect recent empirical observations on feature learning at the level of the weight matrices with concrete properties of the learned functions. In particular, we show that the local complexity serves as an upper bound on the total variation of the function over the input data distribution and thus that feature learning can be related to adversarial robustness. Lastly, we consider how optimization drives ReLU networks towards solutions with lower local complexity. Overall, this work contributes a theoretical framework towards relating geometric properties of ReLU networks to different aspects of learning such as feature learning and representation cost.

Paperid:1142 poster

Authors:Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song

Title: Improving LLM Safety Alignment with Dual-Objective Optimization

Abstract:
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available athttps://github.com/wicai24/DOOR-Alignment.

Paperid:1143 poster

Authors:Yuyang Hu, Albert Peng, Weijie Gan, Peyman Milanfar, Mauricio Delbracio, Ulugbek S. Kamilov

Title: Stochastic Deep Restoration Priors for Imaging Inverse Problems

Abstract:
Deep neural networks trained as image denoisers are widely used as priors for solving imaging inverse problems. We introduce Stochastic deep Restoration Priors (ShaRP), a novel framework that stochastically leverages an ensemble of deep restoration models beyond denoisers to regularize inverse problems. By using generalized restoration models trained on a broad range of degradations beyond simple Gaussian noise, ShaRP effectively addresses structured artifacts and enables self-supervised training without fully sampled data. We prove that ShaRP minimizes an objective function involving a regularizer derived from the score functions of minimum mean square error (MMSE) restoration operators. We also provide theoretical guarantees for learning restoration operators from incomplete measurements. ShaRP achieves state-of-the-art performance on tasks such as magnetic resonance imaging reconstruction and single-image super-resolution, surpassing both denoiser- and diffusion-model-based methods without requiring retraining.

Paperid:1144 poster

Authors:Zhangchi Zhao, Jun Shu, Deyu Meng, Zongben Xu

Title: Improving Memory Efficiency for Training KANs via Meta Learning

Abstract:
Inspired by the Kolmogorov-Arnold representation theorem, KANs offer a novel framework for function approximation by replacing traditional neural network weights with learnable univariate functions. This design demonstrates significant potential as an efficient and interpretable alternative to traditional MLPs. However, KANs are characterized by a substantially larger number of trainable parameters, leading to challenges in memory efficiency and higher training costs compared to MLPs. To address this limitation, we propose to generate weights for KANs via a smaller meta-learner, called MetaKANs. By training KANs and MetaKANs in an end-to-end differentiable manner, MetaKANs achieve comparable or even superior performance while significantly reducing the number of trainable parameters and maintaining promising interpretability. Extensive experiments on diverse benchmark tasks, including symbolic regression, partial differential equation solving, and image classification, demonstrate the effectiveness of MetaKANs in improving parameter efficiency and memory usage. The proposed method provides an alternative technique for training KANs, that allows for greater scalability and extensibility, and narrows the training cost gap with MLPs stated in the original paper of KANs. Our code is available at \url{https://github.com/Murphyzc/MetaKAN}.

Paperid:1145 poster

Authors:Markelle Kelly, Alex James Boyd, Sam Showalter, Mark Steyvers, Padhraic Smyth

Title: Bayesian Inference for Correlated Human Experts and Classifiers

Abstract:
Applications of machine learning often involve making predictions based on both model outputs and the opinions of human experts. In this context, we investigate the problem of querying experts for class label predictions, using as few human queries as possible, and leveraging the class probability estimates of pre-trained classifiers. We develop a general Bayesian framework for this problem, modeling expert correlation via a joint latent representation, enabling simulation-based inference about the utility of additional expert queries, as well as inference of posterior distributions over unobserved expert labels. We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and ImageNet-16H, demonstrating substantial reductions relative to baselines in the cost of querying human experts while maintaining high prediction accuracy.

Paperid:1146 poster

Authors:Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V. S. Lakshmanan, Qingyun Wu, Victor Rühle

Title: BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

Abstract:
Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired tradeoff. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.

Paperid:1147 poster

Authors:Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne

Title: On the Robustness of Reward Models for Language Model Alignment

Abstract:
The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40% while adding a 7% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training.

Paperid:1148 poster

Authors:Pierrick Leroy, Antonio Mastropietro, Marco Nurisso, Francesco Vaccarino

Title: Attributes Shape the Embedding Space of Face Recognition Models

Abstract:
Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast). We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability. Code available here:https://github.com/mantonios107/attrs-fr-embs.

Paperid:1149 poster

Authors:Emanuele Palumbo, Moritz Vandenhirtz, Alain Ryser, Imant Daunhawer, Julia E Vogt

Title: From Logits to Hierarchies: Hierarchical Clustering made Simple

Abstract:
The hierarchical structure inherent in many real-world datasets makes the modeling of such hierarchies a crucial objective in both unsupervised and supervised machine learning. While recent advancements have introduced deep architectures specifically designed for hierarchical clustering, we adopt a critical perspective on this line of research. Our findings reveal that these methods face significant limitations in scalability and performance when applied to realistic datasets.~Given these findings, we present an alternative approach and introduce a lightweight method that builds on pre-trained non-hierarchical clustering models. Remarkably, our approach outperforms specialized deep models for hierarchical clustering, and it is broadly applicable to any pre-trained clustering model that outputs logits, without requiring any fine-tuning. To highlight the generality of our approach, we extend its application to a supervised setting, demonstrating its ability to recover meaningful hierarchies from a pre-trained ImageNet classifier. Our results establish a practical and effective alternative to existing deep hierarchical clustering methods, with significant advantages in efficiency, scalability and performance.

Paperid:1150 poster

Authors:Mehrsa Pourya, Erich Kobler, Michael Unser, Sebastian Neumayer

Title: DEALing with Image Reconstruction: Deep Attentive Least Squares

Abstract:
State-of-the-art image reconstruction often relies on complex, abundantly parameterized deep architectures. We propose an alternative: a data-driven reconstruction method inspired by the classic Tikhonov regularization. Our approach iteratively refines intermediate reconstructions by solving a sequence of quadratic problems. These updates have two key components: (i) learned filters to extract salient image features; and (ii) an attention mechanism that locally adjusts the penalty of the filter responses. Our method matches leading plug-and-play and learned regularizer approaches in performance while offering interpretability, robustness, and convergent behavior. In effect, we bridge traditional regularization and deep learning with a principled reconstruction approach.

Paperid:1151 poster

Authors:Anh Duc Nguyen, Ilia Markov, Zhengqing Wu, Ali Ramezani-Kebrya, Kimon Antonakopoulos, Dan Alistarh, Volkan Cevher

Title: Layer-wise Quantization for Quantized Optimistic Dual Averaging

Abstract:
Modern deep neural networks exhibit heterogeneity across numerous layers of various types such as residuals, multi-head attention, etc., due to varying structures (dimensions, activation functions, etc.), distinct representation characteristics, which impact predictions. We develop a general layer-wise quantization framework with tight variance and code-length bounds, adapting to the heterogeneities over the course of training. We then apply a new layer-wise quantization technique within distributed variational inequalities (VIs), proposing a novel Quantized Optimistic Dual Averaging (QODA) algorithm with adaptive learning rates, which achieves competitive convergence rates for monotone VIs. We empirically show that QODA achieves up to a $150$% speedup over the baselines in end-to-end training time for training Wasserstein GAN on $12+$ GPUs.

Paperid:1152 poster

Authors:Alexander Bukharin, Yixiao Li, Pengcheng He, Tuo Zhao

Title: Deep Reinforcement Learning from Hierarchical Preference Design

Abstract:
Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward design framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness.

Paperid:1153 poster

Authors:Xin Wang, Hanxiao Tao, Rebing Wu

Title: Predictive Performance of Deep Quantum Data Re-uploading Models

Abstract:
Quantum machine learning models incorporating data re-uploading circuits have garnered significant attention due to their exceptional expressivity and trainability. However, their ability to generate accurate predictions on unseen data, referred to as the predictive performance, remains insufficiently investigated. This study reveals a fundamental limitation in predictive performance when deep encoding layers are employed within the data re-uploading model. Concretely, we theoretically demonstrate that when processing high-dimensional data with limited-qubit data re-uploading models, their predictive performance progressively degenerates to near random-guessing levels as the number of encoding layers increases. In this context, the repeated data uploading cannot mitigate the performance degradation. These findings are validated through experiments on both synthetic linearly separable datasets and real-world datasets. Our results demonstrate that when processing high-dimensional data, the quantum data re-uploading models should be designed with wider circuit architectures rather than deeper and narrower ones.

Paperid:1154 poster

Authors:Vincent Peter Grande, Michael T Schaub

Title: Point-Level Topological Representation Learning on Point Clouds

Abstract:
Topological Data Analysis (TDA) allows us to extract powerful topological and higher-order information on the global shape of a data set or point cloud. Tools like Persistent Homology give a single complex description of the global structure of the point cloud. However, common machine learning applications like classification require point-level information and features. In this paper, we bridge this gap and propose a novel method to extract node-level topological features from complex point clouds using discrete variants of concepts from algebraic topology and differential geometry. We verify the effectiveness of these topological point features (TOPF) on both synthetic and real-world data and study their robustness under noise and heterogeneous sampling.

Paperid:1155 poster

Authors:Nicolas Lell, Ansgar Scherp

Title: iN2V: Bringing Transductive Node Embeddings to Inductive Graphs

Abstract:
Shallow node embeddings like node2vec (N2V) can be used for nodes without features or to supplement existing features with structure-based information. Embedding methods like N2V are limited in their application on new nodes, which restricts them to the transductive setting where the entire graph, including the test nodes, is available during training. We propose inductive node2vec (iN2V), which combines a post-hoc procedure to compute embeddings for nodes unseen during training and modifications to the original N2V training procedure to prepare the embeddings for this post-hoc procedure. We conduct experiments on several benchmark datasets and demonstrate that iN2V is an effective approach to bringing transductive embeddings to an inductive setting. Using iN2V embeddings improves node classification by 1 point on average, with up to 6 points of improvement depending on the dataset and the number of unseen nodes. Our iN2V is a plug-in approach to create new or enrich existing embeddings. It can also be combined with other embedding methods, making it a versatile approach for inductive node representation learning. Code to reproduce the results is available athttps://github.com/Foisunt/iN2V.

Paperid:1156 poster

Authors:Yejiang Wang, Yuhai Zhao, Zhengkui Wang, Wen Shan, Ling Li, Qian Li, Miaomiao Huang, Meixia Wang, Shirui Pan, Xingwei Wang

Title: N2GON: Neural Networks for Graph-of-Net with Position Awareness

Abstract:
Graphs, fundamental in modeling various research subjects such as computing networks, consist of nodes linked by edges. However, they typically function as components within larger structures in real-world scenarios, such as in protein-protein interactions where each protein is a graph in a larger network. This study delves into the Graph-of-Net (GON), a structure that extends the concept of traditional graphs by representing each node as a graph itself. It provides a multi-level perspective on the relationships between objects, encapsulating both the detailed structure of individual nodes and the broader network of dependencies. To learn node representations within the GON, we propose a position-aware neural network for Graph-of-Net which processes both intra-graph and inter-graph connections and incorporates additional data like node labels. Our model employs dual encoders and graph constructors to build and refine a constraint network, where nodes are adaptively arranged based on their positions, as determined by the network's constraint system. Our model demonstrates significant improvements over baselines in empirical evaluations on various datasets.

Paperid:1157 poster

Authors:Ting-Ji Huang, Jia-Qi Yang, Chunxu Shen, Kai-Qi Liu, De-Chuan Zhan, Han-Jia Ye

Title: Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

Abstract:
Characterizing users and items through vector representations is crucial for various tasks in recommender systems. Recent approaches attempt to apply Large Language Models (LLMs) in recommendation through a question&answer format, where real items (eg, Item No.2024) are represented with compound words formed from in-vocabulary tokens (eg,item,20,24). However, these tokens are not suitable for representing items, as their meanings are shaped by pre-training on natural language tasks, limiting the model's ability to capture user-item relationships effectively. In this paper, we explore how to effectively characterize users and items in LLM-based recommender systems from the token construction view. We demonstrate the necessity of using out-of-vocabulary (OOV) tokens for the characterization of items and users, and propose a well-constructed way of these OOV tokens. By clustering the learned representations from historical user-item interactions, we make the representations of user/item combinations share the same OOV tokens if they have similar properties. This construction allows us to capture user/item relationships well (memorization) and preserve the diversity of descriptions of users and items (diversity). Furthermore, integrating these OOV tokens into the LLM’s vocabulary allows for better distinction between users and items and enhanced capture of user-item relationships during fine-tuning on downstream tasks. Our proposed framework outperforms existing state-of-the-art methods across various downstream recommendation tasks.

Paperid:1158 poster

Authors:Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang

Title: Training Software Engineering Agents and Verifiers with SWE-Gym

Abstract:
We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.

Paperid:1159 poster

Authors:Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi

Title: ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Abstract:
We investigate the logical reasoning capabilities of Large Language Models (LLMs) and their scalability across complex deductive tasks. Using ZebraLogic, a newly developed benchmark dataset of logic grid puzzles derived from constraint satisfaction problems (CSPs), we systematically evaluate LLM performance. ZebraLogic spans a broad range of search space complexities and incorporates diverse logical constraints, providing a controlled environment to assess reasoning abilities. Our results reveal a significant decline in accuracy as problem complexity increases—a phenomenon we term the “curse of complexity.�?Notably, this limitation persists even with scaling model size and inference-time computation, suggesting fundamental constraints in current LLM reasoning capabilities. Additionally, we explore strategies such as Best-of-N sampling, backtracking mechanisms, and self-verification prompts to enhance logical reasoning performance. Our findings provide critical insights into the scaling behavior of LLMs, highlight their limitations, and outline potential directions for advancing their reasoning capabilities.

Paperid:1160 poster

Authors:Hyo Seo Kim, Dongyoon Han, Junsuk Choe

Title: NegMerge: Sign-Consensual Weight Merging for Machine Unlearning

Abstract:
Machine unlearning aims to selectively remove specific knowledge from a trained model. Existing approaches, such as Task Arithmetic, fine-tune the model on the forget set to create a task vector (i.e., a direction in weight space) for subtraction from the original model's weight. However, their effectiveness is highly sensitive to hyperparameter selection, requiring extensive validation to identify the optimal vector from many fine-tuned candidates. In this paper, we propose a novel method that utilizes all fine-tuned models trained with varying hyperparameters instead of a single selection. Specifically, we aggregate the computed task vectors by retaining only the elements with consistent shared signs. The merged task vector is then negated to induce unlearning on the original model. Evaluations on zero-shot and standard image recognition tasks across twelve datasets and four backbone architectures show that our approach outperforms state-of-the-art methods while requiring similar or fewer computational resources. Code is available athttps://github.com/naver-ai/negmerge.

Paperid:1161 poster

Authors:Arnold Filtser, Shaofeng H.-C. Jiang, Yi Li, Anurag Murty Naredla, Ioannis Psarros, Qiaoyuan Yang, Qin Zhang

Title: Faster Approximation Algorithms for k-Center via Data Reduction

Abstract:
We study efficient algorithms for the Euclidean $k$-Center problem, focusing on the regime of large $k$. We take the approach of data reduction by considering $\alpha$-coreset, which is a small subset $S$ of the dataset $P$ such that any $\beta$-approximation on $S$ is an $(\alpha + \beta)$-approximation on $P$. We give efficient algorithms to construct coresets whose size is $k \cdot o(n)$, which immediately speeds up existing approximation algorithms. Notably, we obtain a near-linear time $O(1)$-approximation when $k = n^c$ for any $0 < c < 1$. We validate the performance of our coresets on real-world datasets with large $k$, and we observe that the coreset speeds up the well-known Gonzalez algorithm by up to $4$ times, while still achieving similar clustering cost. Technically, one of our coreset results is based on a new efficient construction of consistent hashing with competitive parameters. This general tool may be of independent interest for algorithm design in high dimensional Euclidean spaces.

Paperid:1162 poster

Authors:Lorenzo Magnino, Yuchen Zhu, Mathieu Lauriere

Title: Learning to Stop: Deep Learning for Mean Field Optimal Stopping

Abstract:
Optimal stopping is a fundamental problem in optimization with applications in risk management, finance, robotics, and machine learning. We extend the standard framework to a multi-agent setting, named multi-agent optimal stopping (MAOS), where agents cooperate to make optimal stopping decisions in a finite-space, discrete-time environment. Since solving MAOS becomes computationally prohibitive as the number of agents is very large, we study the mean-field optimal stopping (MFOS) problem, obtained as the number of agents tends to infinity. We establish that MFOS provides a good approximation to MAOS and prove a dynamic programming principle (DPP) based on mean-field control theory. We then propose two deep learning approaches: one that learns optimal stopping decisions by simulating full trajectories and another that leverages the DPP to compute the value function and to learn the optimal stopping rule using backward induction. Both methods train neural networks to approximate optimal stopping policies. We demonstrate the effectiveness and the scalability of our work through numerical experiments on 6 different problems in spatial dimension up to 300. To the best of our knowledge, this is the first work to formalize and computationally solve MFOS in discrete time and finite space, opening new directions for scalable MAOS methods.

Paperid:1163 poster

Authors:Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, Alexander Long

Title: Nesterov Method for Asynchronous Pipeline Parallel Optimization

Abstract:
Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading tostale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.

Paperid:1164 poster

Authors:Ali Behrouz, Ali Parviz, Mahdi Karami, Clayton Sanford, Bryan Perozzi, Vahab Mirrokni

Title: Best of Both Worlds: Advantages of Hybrid Graph Sequence Models

Abstract:
Modern sequence models (e.g., Transformers and linear RNNs) emerged as dominant backbones of recent deep learning frameworks, mainly due to their efficiency, representational power, and/or ability to capture long-range dependencies. Recently, adopting these sequence models for graph-structured data has gained popularity as the alternative to Message Passing Neural Networks (MPNNs). There is, however, a lack of a common foundation about what constitutes a good graph sequence model, and a mathematical description of the benefits and deficiencies in adopting different sequence models for learning on graphs. To this end, we introduce the Graph Sequence Model (GSM), a unifying framework for applying sequence models to graph data. The GSM framework allows us to understand, evaluate, and compare the power of different sequence model backbones in graph tasks. Building on this insight, we propose GSM++, a fast hybrid model that hierarchically tokenizes the graph using Hierarchical Affinity Clustering (HAC) and then encodes these sequences via a hybrid architecture. The theoretical and experimental findings confirm that GSM++ outperforms baseline models on most benchmarks.

Paperid:1165 poster

Authors:Jacopo Talpini, Marco Savi, Giovanni Neglia

Title: FedBEns: One-Shot Federated Learning based on Bayesian Ensemble

Abstract:
One-Shot Federated Learning (FL) is a recent paradigm that enables multiple clients to cooperatively learn a global model in a single round of communication with a central server. In this paper, we analyze the One-Shot FL problem through the lens of Bayesian inference and propose FedBEns, an algorithm that leverages the inherent multimodality of local loss functions to find better global models.Our algorithm leverages a mixture of Laplace approximations for the clients' local posteriors, which the server then aggregates to infer the global model. We conduct extensive experiments on various datasets, demonstrating that the proposed method outperforms competing baselines that typically rely on unimodal approximations of the local losses.

Paperid:1166 poster

Authors:Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang

Title: BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Abstract:
Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

Paperid:1167 poster

Authors:Kai Yi, Kiarash Jamali, Sjors HW Scheres

Title: All-atom inverse protein folding through discrete flow matching

Abstract:
The recent breakthrough of AlphaFold3 in modeling complex biomolecular interactions, including those between proteins and ligands, nucleotides, or metal ions, creates new opportunities for protein design. In so-called inverse protein folding, the objective is to find a sequence of amino acids that adopts a target protein structure. Many inverse folding methods struggle to predict sequences for complexes that contain non-protein components, and perform poorly with complexes that adopt multiple structural states. To address these challenges, we present ADFLIP (All-atom Discrete FLow matching Inverse Protein folding), a generative model based on discrete flow-matching for designing protein sequences conditioned on all-atom structural contexts. ADFLIP progressively incorporates predicted amino acid side chains as structural context during sequence generation and enables the design of dynamic protein complexes through ensemble sampling across multiple structural states. Furthermore, ADFLIP implements training-free classifier guidance sampling, which allows the incorporation of arbitrary pre-trained models to optimise the designed sequence for desired protein properties. We evaluated the performance of ADFLIP on protein complexes with small-molecule ligands, nucleotides, or metal ions, including dynamic complexes for which structure ensembles were determined by nuclear magnetic resonance (NMR). Our model achieves state-of-the-art performance in single-structure and multi-structure inverse folding tasks, demonstrating excellent potential for all-atom protein design. The code is available athttps://github.com/ykiiiiii/ADFLIP.

Paperid:1168 poster

Authors:Jiaming Ma, Binwu Wang, Pengkun Wang, Zhengyang Zhou, Xu Wang, Yang Wang

Title: Robust Spatio-Temporal Centralized Interaction for OOD Learning

Abstract:
Recently, spatiotemporal graph convolutional networks have achieved dominant performance in spatiotemporal prediction tasks. However, most models relying on node-to-node messaging interaction exhibit sensitivity to spatiotemporal shifts, encountering out-of-distribution (OOD) challenges. To address these issues, we introduce \textbf{\underline{S}}patio-\textbf{\underline{T}}emporal \textbf{\underline{O}}OD \textbf{\underline{P}}rocessor (STOP), which employs a centralized messaging mechanism along with a message perturbation mechanism to facilitate robust spatiotemporal interactions. Specifically, the centralized messaging mechanism integrates Context-Aware Units for coarse-grained spatiotemporal feature interactions with nodes, effectively blocking traditional node-to-node messages. We also implement a message perturbation mechanism to disrupt this messaging process, compelling the model to extract generalizable contextual features from generated variant environments. Finally, we customize a spatiotemporal distributionally robust optimization approach that exposes the model to challenging environments, thereby further enhancing its generalization capabilities. Compared with 14 baselines across six datasets, STOP achieves up to \textbf{17.01%} improvement in generalization performance and \textbf{18.44%} improvement in inductive learning performance. The code is available athttps://github.com/PoorOtterBob/STOP.

Paperid:1169 poster

Authors:Cun-Yuan Xing, Meng-Zhang Qian, Wu-Yang Chen, Wei Gao, Zhi-Hua Zhou

Title: One-Pass Feature Evolvable Learning with Theoretical Guarantees

Abstract:
Feature evolvable learning studies the scenario where old features will vanish and new features will emerge when learning with data streams, and various methods have been developed by utilizing some useful relationships from old features to new features, rather than re-training from scratch. In this work, we focus on two fundamental problems: How to characterize the relationships between two different feature spaces, and how to exploit those relationships for feature evolvable learning. We introduce the Kernel Ortho-Mapping (KOM) discrepancy to characterize relationships between two different feature spaces via kernel functions, and correlate with the optimal classifiers learned from different feature spaces. Based on this discrepancy, we develop the one-pass algorithm for feature evolvable learning, which requires going through all instances only once without storing the entire or partial training data. Our basic idea is to take online kernel learning with the random Fourier features and incorporate some feature and label relationships via the KOM discrepancy for feature evolvable learning. We finally validate the effectiveness of our proposed method both theoretically and empirically.

Paperid:1170 poster

Authors:Clara Fannjiang, Ji Won Park

Title: Reliable Algorithm Selection for Machine Learning-Guided Design

Abstract:
Algorithms for machine learning-guided design, ordesign algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task—for example, to design novel proteins with high binding affinity to a therapeutic target—one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method fordesign algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion—for example, that at least ten percent of designs�?labels exceed a threshold. It does so by combining designs�?predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference (Angelopoulos et al., 2023). The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method’s effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.

Paperid:1171 poster

Authors:Álvaro Rodríguez Abella, João Pedro Silvestre, Paulo Tabuada

Title: Consensus Is All You Get: The Role of Attention in Transformers

Abstract:
A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.

Paperid:1172 poster

Authors:Zeyuan Ma, Zhiguang Cao, Zhou Jiang, Hongshu Guo, Yue-Jiao Gong

Title: Meta-Black-Box-Optimization through Offline Q-function Learning

Abstract:
Recent progress in Meta-Black-Box-Optimization (MetaBBO) has demonstrated that using RL to learn a meta-level policy for dynamic algorithm configuration (DAC) over an optimization task distribution could significantly enhance the performance of the low-level BBO algorithm. However, the online learning paradigms in existing works makes the efficiency of MetaBBO problematic. To address this, we propose an offline learning-based MetaBBO framework in this paper, termed Q-Mamba, to attain both effectiveness and efficiency in MetaBBO. Specifically, we first transform DAC task into long-sequence decision process. This allows us further introduce an effective Q-function decomposition mechanism to reduce the learning difficulty within the intricate algorithm configuration space. Under this setting, we propose three novel designs to meta-learn DAC policy from offline data: we first propose a novel collection strategy for constructing offline DAC experiences dataset with balanced exploration and exploitation. We then establish a decomposition-based Q-loss that incorporates conservative Q-learning to promote stable offline learning from the offline dataset. To further improve the offline learning efficiency, we equip our work with a Mamba architecture which helps long-sequence learning effectiveness and efficiency by selective state model and hardware-aware parallel scan respectively. Through extensive benchmarking, we observe that Q-Mamba achieves competitive or even superior performance to prior online/offline baselines, while significantly improving the training efficiency of existing online baselines. We provide sourcecodes of Q-Mamba \href{https://github.com/MetaEvo/Q-Mamba}{online}.

Paperid:1173 poster

Authors:Chris Dong, Jannik Peters

Title: Maintaining Proportional Committees with Dynamic Candidate Sets

Abstract:
Multiwinner voting is the study of electing a fixed-size committee given individual agents' preferences over candidates. Most research in this field has been limited to a static setting, with only one election over a fixed set of candidates. However, this approach overlooks the dynamic nature of applications, where candidate sets are subject to change. We extend the study of proportionality in multiwinner voting to dynamic settings, allowing candidates to join or leave the election and demanding that each chosen committee satisfies proportionality without differing too much from the previously selected committee. We consider approval preferences, ranked preferences, and the proportional clustering setting. In these settings, we either give algorithms making few changes or show that such algorithms cannot exist for various proportionality axioms. In particular, we show that such algorithms cannot exist for ranked preferences and provide amortized and exact algorithms for several proportionality notions in the other two settings.

Paperid:1174 poster

Authors:Wei Yao, Zeliang Zhang, Huayi Tang, Yong Liu

Title: Understanding Model Ensemble in Transferable Adversarial Attack

Abstract:
Model ensemble adversarial attack has become a powerful method for generating transferable adversarial examples that can target even unknown models, but its theoretical foundation remains underexplored. To address this gap, we provide early theoretical insights that serve as a roadmap for advancing model ensemble adversarial attack. We first define transferability error to measure the error in adversarial transferability, alongside concepts of diversity and empirical model ensemble Rademacher complexity. We then decompose the transferability error into vulnerability, diversity, and a constant, which rigidly explains the origin of transferability error in model ensemble attack: the vulnerability of an adversarial example to ensemble components, and the diversity of ensemble components. Furthermore, we apply the latest mathematical tools in information theory to bound the transferability error using complexity and generalization terms, validating three practical guidelines for reducing transferability error: (1) incorporating more surrogate models, (2) increasing their diversity, and (3) reducing their complexity in cases of overfitting. Finally, extensive experiments with 54 models validate our theoretical framework, representing a significant step forward in understanding transferable model ensemble adversarial attacks.

Paperid:1175 poster

Authors:Stefano Viel, Luca Viano, Volkan Cevher

Title: IL-SOAR : Imitation Learning with Soft Optimistic Actor cRitic

Abstract:
This paper introduces the SOAR framework for imitation learning. SOAR is an algorithmic template which learns a policy from expert demonstrations with a primal-dual style algorithm which alternates cost and policy updates. Within the policy updates the SOAR framework prescribe to use an actor critic method with multiple critics to estimate the critic uncertainty and therefore build an optimistic critic fundamental to drive exploration. When instantiated to the tabular setting, we get a provable algorithms dubbed FRA with guarantees matching the best known results inϵ. Practically, the SOAR template is shown to boost consistently the performance of primal dual IL algorithms building on actor critic routines for the policy updates. Approximately, thanks to SOAR, the required number of episodes to achieve the same performance is reduced by a half.

Paperid:1176 poster

Authors:Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zheng-Jun Zha, Peng CHENG

Title: Optimizing Large Language Model Training Using FP4 Quantization

Abstract:
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

Paperid:1177 poster

Authors:Jing Xu, Jiazheng Li, Jingzhao Zhang

Title: Scalable Model Merging with Progressive Layer-wise Distillation

Abstract:
Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performances. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer-wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods,ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

Paperid:1178 poster

Authors:Yongqiang Chen, Quanming Yao, Juzheng Zhang, James Cheng, Yatao Bian

Title: Hierarchical Graph Tokenization for Molecule-Language Alignment

Abstract:
Recently, there has been a surge of interest in extending the success of large language models (LLMs) from texts to molecules. Most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for molecule-language alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, higher-order molecular structures contain rich semantics of functional groups, which encode crucial biochemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar molecule-language alignment and severe hallucination. To address this limitation, we propose HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks. The project is available at https: //higraphllm.github.io/.

Paperid:1179 poster

Authors:Harry Amad, Nicolás Astorga, Mihaela van der Schaar

Title: Continuously Updating Digital Twins using Large Language Models

Abstract:
Digital twins are models of real-world systems that can simulate their dynamics in response to potential actions. In complex settings, the state and action variables, and available data and knowledge relevant to a system can constantly change, requiring digital twins to continuously update with these changes to remain relevant. Current approaches struggle in this regard, as they require fixed, well-defined modelling environments, and they cannot adapt to novel variables without re-designs, or incorporate new information without re-training. To address this, we frame digital twinning as an in-context learning problem using large language models, enabling seamless updates to the twin at inference time. We develop CALM-DT, a Context-Adaptive Language Model-based Digital Twin that can accurately simulate across diverse state-action spaces using in-context learning alone by utilising fine-tuned encoders for sample retrieval. We empirically demonstrate CALM-DT's competitive performance with existing digital twin approaches, and its unique ability to adapt to changes in its modelling environment without parameter updates.

Paperid:1180 poster

Authors:Mingyang Wu, Li Lin, Wenbin Zhang, Xin Wang, Zhenhuan Yang, Shu Hu

Title: Preserving AUC Fairness in Learning with Noisy Protected Groups

Abstract:
The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under the assumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is inhttps://github.com/Purdue-M2/AUC_Fairness_with_Noisy_Groups.

Paperid:1181 poster

Authors:Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong

Title: Teaching Language Models to Critique via Reinforcement Learning

Abstract:
Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provideaccurate judgmentsandactionable suggestions. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

Paperid:1182 poster

Authors:Rustem Islamov, Yarden As, Ilyas Fatkhullin

Title: Safe-EF: Error Feedback for Non-smooth Constrained Optimization

Abstract:
Federated learning faces severe communication bottlenecks due to the high dimensionality of model updates. Communication compression with contractive compressors (e.g., Top-K) is often preferable in practice but can degrade performance without proper handling. Error feedback (EF) mitigates such issues but has been largely restricted for smooth, unconstrained problems, limiting its real-world applicability where non-smooth objectives and safety constraints are critical. We advance our understanding of EF in the canonical non-smooth convex setting by establishing new lower complexity bounds for first-order algorithms with contractive compression. Next, we propose Safe-EF, a novel algorithm that matches our lower bound (up to a constant) while enforcing safety constraints essential for practical applications. Extending our approach to the stochastic setting, we bridge the gap between theory and practical implementation. Extensive experiments in a reinforcement learning setup, simulating distributed humanoid robot training, validate the effectiveness of Safe-EF in ensuring safety and reducing communication complexity.

Paperid:1183 poster

Authors:Nora Schneider, Lars Lorch, Niki Kilbertus, Bernhard Schölkopf, Andreas Krause

Title: Generative Intervention Models for Causal Perturbation Modeling

Abstract:
We consider the problem of predicting perturbation effects via causal models. In many applications, it is a priori unknown which mechanisms of a system are modified by an external perturbation, even though the features of the perturbation are available. For example, in genomics, some properties of a drug may be known, but not their causal effects on the regulatory pathways of cells. We propose a generative intervention model (GIM) that learns to map these perturbation features to distributions over atomic interventions in a jointly-estimated causal model. Contrary to prior approaches, this enables us to predict the distribution shifts of unseen perturbation features while gaining insights about their mechanistic effects in the underlying data-generating process. On synthetic data and scRNA-seq drug perturbation data, GIMs achieve robust out-of-distribution predictions on par with unstructured approaches, while effectively inferring the underlying perturbation mechanisms, often better than other causal inference methods.

Paperid:1184 poster

Authors:Mingjun Pan, Guanquan Lin, You-Wei Luo, Bin Zhu, Zhien Dai, Lijun Sun, Chun Yuan

Title: Preference Optimization for Combinatorial Optimization Problems

Abstract:
Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we proposePreference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropy-regularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-process to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality.

Paperid:1185 poster

Authors:Ming Lin, Lin CHEN

Title: Ensemble Learned Bloom Filters: Two Oracles are Better than One

Abstract:
Bloom filters (BF) are space-efficient probabilistic data structures for approximate membership testing. Boosted by the proliferation of machine learning, learned Bloom filters (LBF) were recently proposed by augmenting the canonical BFs with a learned oracle as a pre-filter, the size of which is crucial to the compactness of the overall system. In this paper, inspired by ensemble learning, we depart from the state-of-the-art single-oracle LBF structure by demonstrating that, by leveraging multiple learning oracles of smaller size and carefully optimizing the accompanied backup filters, we can significantly boost the performance of LBF under the same space budget. We then design and optimize ensemble learned Bloom filters for mutually independent and correlated learning oracles respectively. We also empirically demonstrate the performance improvement of our propositions under three practical data analysis tasks.

Paperid:1186 poster

Authors:Jifei Luo, Wenzheng Wu, Hantao Yao, Lu Yu, Changsheng Xu

Title: Locality Preserving Markovian Transition for Instance Retrieval

Abstract:
Diffusion-based re-ranking methods are effective in modeling the data manifolds through similarity propagation in affinity graphs. However, positive signals tend to diminish over several steps away from the source, reducing discriminative power beyond local regions. To address this issue, we introduce the Locality Preserving Markovian Transition (LPMT) framework, which employs a long-term thermodynamic transition process with multiple states for accurate manifold distance measurement. The proposed LPMT first integrates diffusion processes across separate graphs using Bidirectional Collaborative Diffusion (BCD) to establish strong similarity relationships. Afterwards, Locality State Embedding (LSE) encodes each instance into a distribution for enhanced local consistency. These distributions are interconnected via the Thermodynamic Markovian Transition (TMT) process, enabling efficient global retrieval while maintaining local effectiveness. Experimental results across diverse tasks confirm the effectiveness of LPMT for instance retrieval.

Paperid:1187 poster

Authors:Weimin Wu, Teng-Yun Hsiao, Jerry Yao-Chieh Hu, Wenxin Zhang, Han Liu

Title: In-Context Learning as Conditioned Associative Memory Retrieval

Abstract:
We provide an exactly solvable example for interpreting In-Context Learning (ICL) with one-layer attention models as conditional retrieval of dense associative memory models. Our main contribution is to interpret ICL as memory reshaping in the modern Hopfield model from a conditional memory set (in-context examples). Specifically, we show that the in-context sequential examples induce an effective reshaping of the energy landscape of a Hopfield model. We integrate this in-context memory reshaping phenomenon into the existing Bayesian model averaging view of ICL [Zhang et al., AISTATS 2025] via the established equivalence between the modern Hopfield model and transformer attention. Under this unique perspective, we not only characterize how in-context examples shape predictions in the Gaussian linear regression case, but also recover the knownϵ-stability generalization bound of the ICL for the one-layer attention model. We also give explanations for three key behaviors of ICL and validate them through experiments.

Paperid:1188 poster

Authors:Haozheng Luo, Chenghao Qiu, Maojiang Su, Zhihan Zhou, Zoe Mehta, Guo Ye, Jerry Yao-Chieh Hu, Han Liu

Title: Fast and Low-Cost Genomic Foundation Models via Outlier Removal

Abstract:
To address the challenge of scarce computational resources in genomic modeling, we introduce GERM, a genomic foundation model optimized for accessibility and adaptability. GERM improves upon models like DNABERT-2 by eliminating outliers that hinder low-rank adaptation and post-training quantization, enhancing both efficiency and robustness. We replace the vanilla attention layer with an outlier-free mechanism inspired by associative memory models. By removing outliers during both pre-training and fine-tuning, this approach accelerates adaptation, reduces computational costs, and enhances quantization robustness within acceptable loss margins. Additionally, we propose GERM-T, a strategy that employs small-step continual learning within the outlier-free framework, leveraging original checkpoints to avoid retraining from scratch. Empirically, GERM improves fine-tuning performance by 37.98% and quantization by 64.34% over the baseline model. It also reduces average kurtosis by 92.14% and maximum infinity norm by 82.77%. Compared to leading methods, GERM consistently delivers superior performance, offering a practical solution for genomic modeling in resource-constrained settings.

Paperid:1189 poster

Authors:Pedram Pad, Hadi Hammoud, Mohamad Dia, Nadim Maamari, Liza Andrea Dunbar

Title: SAND: One-Shot Feature Selection with Additive Noise Distortion

Abstract:
Feature selection is a critical step in data-driven applications, reducing input dimensionality to enhance learning accuracy, computational efficiency, and interpretability. Existing state-of-the-art methods often require post-selection retraining and extensive hyperparameter tuning, complicating their adoption. We introduce a novel, non-intrusive feature selection layer that, given a target feature count $k$, automatically identifies and selects the $k$ most informative features during neural network training. Our method is uniquely simple, requiring no alterations to the loss function, network architecture, or post-selection retraining. The layer is mathematically elegant and can be fully described by: \begin{align} \nonumber \tilde{x}_i = a_i x_i + (1-a_i)z_i \end{align} where $x_i$ is the input feature, $\tilde{x}_i$ the output, $z_i$ a Gaussian noise, and $a_i$ trainable gain such that $\sum_i{a_i^2}=k$. This formulation induces an automatic clustering effect, driving $k$ of the $a_i$ gains to $1$ (selecting informative features) and the rest to $0$ (discarding redundant ones) via weighted noise distortion and gain normalization. Despite its extreme simplicity, our method achieves competitive performance on standard benchmark datasets and a novel real-world dataset, often matching or exceeding existing approaches without requiring hyperparameter search for $k$ or retraining. Theoretical analysis in the context of linear regression further validates its efficacy. Our work demonstrates that simplicity and performance are not mutually exclusive, offering a powerful yet straightforward tool for feature selection in machine learning.

Paperid:1190 poster

Authors:Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

Title: Learning Invariant Causal Mechanism from Vision-Language Models

Abstract:
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, but its performance can degrade when fine-tuned in out-of-distribution (OOD) scenarios. We model the prediction process using a Structural Causal Model (SCM) and show that the causal mechanism involving both invariant and variant factors in training environments differs from that in test environments. In contrast, the causal mechanism with solely invariant factors remains consistent across environments. We theoretically prove the existence of a linear mapping from CLIP embeddings to invariant factors, which can be estimated using interventional data. Additionally, we provide a condition to guarantee low OOD risk of the invariant predictor. Based on these insights, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM) framework. CLIP-ICM involves collecting interventional data, estimating a linear projection matrix, and making predictions within the invariant subspace. Experiments on several OOD datasets show that CLIP-ICM significantly improves the performance of CLIP. Our method offers a simple but powerful enhancement, boosting the reliability of CLIP in real-world applications.

Paperid:1191 poster

Authors:Bin Luo, Yuwen Huang, Jonathan Allcock, Xiaojun Lin, Shengyu Zhang, John C.S. Lui

Title: Quantum Algorithms for Finite-horizon Markov Decision Processes

Abstract:
In this work, we design quantum algorithms that are more efficient than classical algorithms to solve time-dependent and finite-horizon Markov Decision Processes (MDPs) in two distinct settings: (1) In the exact dynamics setting, where the agent has full knowledge of the environment's dynamics (i.e., transition probabilities), we prove that ourQuantum Value Iteration (QVI)algorithmQVI-1achieves a quadratic speedup in the size of the action space $(A)$ compared with the classical value iteration algorithm for computing the optimal policy ($\pi^{\ast}$) and the optimal V-value function ($V_{0}^{\ast}$). Furthermore, our algorithmQVI-2provides an additional speedup in the size of the state space $(S)$ when obtaining near-optimal policies and V-value functions. BothQVI-1andQVI-2achieve quantum query complexities that provably improve upon classical lower bounds, particularly in their dependences on $S$ and $A$. (2) In the generative model setting, where samples from the environment are accessible in quantum superposition, we prove that our algorithmsQVI-3andQVI-4achieve improvements in sample complexity over the state-of-the-art (SOTA) classical algorithm in terms of $A$, estimation error $(\epsilon)$, and time horizon $(H)$. More importantly, we prove quantum lower bounds to show thatQVI-3andQVI-4are asymptotically optimal, up to logarithmic factors, assuming a constant time horizon.

Paperid:1192 poster

Authors:Xi Chen, Yateng Tang, Jiarong Xu, Jiawei Zhang, Siwei Zhang, Sijia Peng, Xuehao Zheng, Yun Xiong

Title: Rethinking Time Encoding via Learnable Transformation Functions

Abstract:
Effectively modeling time information and incorporating it into applications or models involving chronologically occurring events is crucial. Real-world scenarios often involve diverse and complex time patterns, which pose significant challenges for time encoding methods. While previous methods focus on capturing time patterns, many rely on specific inductive biases, such as using trigonometric functions to model periodicity. This narrow focus on single-pattern modeling makes them less effective in handling the diversity and complexities of real-world time patterns. In this paper, we investigate to improve the existing commonly used time encoding methods and introduceLearnable Transformation-based Generalized Time Encoding (LeTE). We propose using deep function learning techniques to parameterize nonlinear transformations in time encoding, making them learnable and capable of modeling generalized time patterns, including diverse and complex temporal dynamics. By enabling learnable transformations, LeTE encompasses previous methods as specific cases and allows seamless integration into a wide range of tasks. Through extensive experiments across diverse domains, we demonstrate the versatility and effectiveness of LeTE.

Paperid:1193 poster

Authors:Mei Li, Yuxiang Lu, Qinyan Dai, Suizhi Huang, Yue Ding, Hongtao Lu

Title: BECAME: Bayesian Continual Learning with Adaptive Model Merging

Abstract:
Continual Learning (CL) strives to learn incrementally across tasks while mitigating catastrophic forgetting. A key challenge in CL is balancing stability (retaining prior knowledge) and plasticity (learning new tasks). While representative gradient projection methods ensure stability, they often limit plasticity. Model merging techniques offer promising solutions, but prior methods typically rely on empirical assumptions and carefully selected hyperparameters. In this paper, we explore the potential of model merging to enhance the stability-plasticity trade-off, providing theoretical insights that underscore its benefits. Specifically, we reformulate the merging mechanism using Bayesian continual learning principles and derive a closed-form solution for the optimal merging coefficient that adapts to the diverse characteristics of tasks. To validate our approach, we introduce a two-stage framework named BECAME, which synergizes the expertise of gradient projection and adaptive merging. Extensive experiments show that our approach outperforms state-of-the-art CL methods and existing merging strategieshttps://github.com/limei0818/BECAME.

Paperid:1194 poster

Authors:Zixiang Ai, Zichen Liu, Yuanhang Lei, Zhenyu Cui, Xu Zou, Jiahuan Zhou

Title: GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model

Abstract:
Pre-trained 3D vision models have gained significant attention for their promising performance on point cloud data. However, fully fine-tuning these models for downstream tasks is computationally expensive and storage-intensive. Existing parameter-efficient fine-tuning (PEFT) approaches, which focus primarily on input token prompting, struggle to achieve competitive performance due to their limited ability to capture the geometric information inherent in point clouds. To address this challenge, we propose a novel Geometry-Aware Point Cloud Prompt (GAPrompt) that leverages geometric cues to enhance the adaptability of 3D vision models. First, we introduce a Point Prompt that serves as an auxiliary input alongside the original point cloud, explicitly guiding the model to capture fine-grained geometric details. Additionally, we present a Point Shift Prompter designed to extract global shape information from the point cloud, enabling instance-specific geometric adjustments at the input level. Moreover, our proposed Prompt Propagation mechanism incorporates the shape information into the model's feature extraction process, further strengthening its ability to capture essential geometric characteristics. Extensive experiments demonstrate that GAPrompt significantly outperforms state-of-the-art PEFT methods and achieves competitive results compared to full fine-tuning on various benchmarks, while utilizing only 2.19% of trainable parameters.

Paperid:1195 poster

Authors:Kunda Yan, Min Zhang, Sen Cui, Qu Zikun, Bo Jiang, Feng Liu, Changshui Zhang

Title: CALM: Consensus-Aware Localized Merging for Multi-Task Learning

Abstract:
Model merging aims to integrate the strengths of multiple fine-tuned models into a unified model while preserving task-specific capabilities. Existing methods, represented by task arithmetic, are typically classified into global- and local-aware methods. However, global-aware methods inevitably cause parameter interference, while local-aware methods struggle to maintain the effectiveness of task-specific details in the merged model. To address these limitations, we propose a Consensus Aware Localized Merging (CALM) method which incorporates localized information aligned with global task consensus, ensuring its effectiveness post-merging. CALM consists of three key components: (1) class-balanced entropy minimization sampling, providing a more flexible and reliable way to leverage unsupervised data; (2) an efficient-aware framework, selecting a small set of tasks for sequential merging with high scalability; (3) a consensus-aware mask optimization, aligning localized binary masks with global task consensus and merging them conflict-free. Experiments demonstrate the superiority and robustness of our CALM, significantly outperforming existing methods and achieving performance close to traditional MTL.

Paperid:1196 poster

Authors:Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni

Title: Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Abstract:
Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we proposeZapFormat, a noveldynamic pruningstrategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not onlyconsistently maintainshigh-precision compliant outputs but also achievessignificant improvementsin inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source athttps://github.com/Dan-wanna-M/formatron.

Paperid:1197 poster

Authors:Chenhui Xu, Dancheng Liu, Yuting Hu, Jiajie Li, Ruiyang Qin, Qingxiao Zheng, Jinjun Xiong

Title: Sub-Sequential Physics-Informed Learning with State Space Model

Abstract:
Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3% compared with state-of-the-art architecture. Our code is available at Supplementary Material.

Paperid:1198 poster

Authors:Marten Lienen, Abdullah Saydemir, Stephan Günnemann

Title: UnHiPPO: Uncertainty-aware Initialization for State Space Models

Abstract:
State space models are emerging as a dominant model class for sequence problems with many relying on the HiPPO framework to initialize their dynamics. However, HiPPO fundamentally assumes data to be noise-free; an assumption often violated in practice. We extend the HiPPO theory with measurement noise and derive an uncertainty-aware initialization for state space model dynamics. In our analysis, we interpret HiPPO as a linear stochastic control problem where the data enters as a noise-free control signal. We then reformulate the problem so that the data become noisy outputs of a latent system and arrive at an alternative dynamics initialization that infers the posterior of this latent system from the data without increasing runtime. Our experiments show that our initialization improves the resistance of state-space models to noise both at training and inference time.

Paperid:1199 poster

Authors:Qingpo Wuwu, Chonghan Gao, Tianyu Chen, Yihang Huang, Yuekai Zhang, Jianing Wang, Jianxin Li, Haoyi Zhou, Shanghang Zhang

Title: PINNsAgent: Automated PDE Surrogation with Large Language Models

Abstract:
Solving partial differential equations (PDEs) using neural methods has been a long-standing scientific and engineering research pursuit. Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative to traditional numerical methods for solving PDEs. However, the gap between domain-specific knowledge and deep learning expertise often limits the practical application of PINNs. Previous works typically involve manually conducting extensive PINNs experiments and summarizing heuristic rules for hyperparameter tuning. In this work, we introduce PINNsAgent, a novel surrogation framework that leverages large language models (LLMs) to bridge the gap between domain-specific knowledge and deep learning. PINNsAgent integrates Physics-Guided Knowledge Replay (PGKR) for efficient knowledge transfer from solved PDEs to similar problems, and Memory Tree Reasoning for exploring the search space of optimal PINNs architectures. We evaluate PINNsAgent on 14 benchmark PDEs, demonstrating its effectiveness in automating the surrogation process and significantly improving the accuracy of PINNs-based solutions.

Paperid:1200 poster

Authors:Floor Eijkelboom, Heiko Zimmermann, Sharvaree Vadgama, Erik J Bekkers, Max Welling, Christian A. Naesseth, Jan-Willem van de Meent

Title: Controlled Generation with Equivariant Variational Flow Matching

Abstract:
We derive a controlled generation objective within the framework of Variational Flow Matching (VFM), which casts flow matching as a variational inference problem. We demonstrate that controlled generation can be implemented two ways: (1) by way of end-to-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.

Paperid:1201 poster

Authors:Antonia Wüst, Tim Tobiasch, Lukas Helff, Inga Ibs, Wolfgang Stammer, Devendra Singh Dhami, Constantin A. Rothkopf, Kristian Kersting

Title: Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

Abstract:
Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's o1, have emerged, seemingly demonstrating advanced reasoning capabilities across text and image modalities. However, the depth of these advances in language-guided perception and abstract reasoning remains underexplored, and it is unclear whether these models can truly live up to their ambitious promises. To assess the progress and identify shortcomings, we enter the wonderland of Bongard problems, a set of classic visual reasoning puzzles that require human-like abilities of pattern recognition and abstract reasoning. With our extensive evaluation setup, we show that while VLMs occasionally succeed in identifying discriminative concepts and solving some of the problems, they frequently falter. Surprisingly, even elementary concepts that may seem trivial to humans, such as simple spirals, pose significant challenges. Moreover, when explicitly asked to recognize ground truth concepts, they continue to falter, suggesting not only a lack of understanding of these elementary visual concepts but also an inability to generalize to unseen concepts. We compare the results of VLMs to human performance and observe that a significant gap remains between human visual reasoning capabilities and machine cognition.

Paperid:1202 poster

Authors:Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, Bo Li

Title: GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

Abstract:
The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page:https://guardagent.github.io/

Paperid:1203 poster

Authors:Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski

Title: On the Query Complexity of Verifier-Assisted Language Generation

Abstract:
Recently, a plethora of works have proposed inference-time algorithms (e.g. best-of-n), which incorporate verifiers to assist the generation process. Their quality-efficiency trade-offs have been empirically benchmarked on a variety of constrained generation tasks, but the algorithmic design landscape is still largely poorly understood. In this paper, we develop a mathematical framework for reasoning about constrained generation using a pre-trained language model generator oracle and a process verifier—which can decide whether a prefix can be extended to a string which satisfies the constraints of choice. We show that even in very simple settings, access to a verifier can render an intractable problem (information-theoretically or computationally) to a tractable one. In fact, we show even simple algorithms, like tokenwise rejection sampling, can enjoy significant benefits from access to a verifier. Empirically, we show that a natural modification of tokenwise rejection sampling, in which the sampler is allowed to "backtrack" (i.e., erase the final few generated tokens) has robust and substantive benefits over natural baselines (e.g. (blockwise) rejection sampling, nucleus sampling)—both in terms of computational efficiency, accuracy and diversity.

Paperid:1204 poster

Authors:Mauricio Soroco, Jialin Song, Mengzhou Xia, Kye Emond, Weiran Sun, Wuyang Chen

Title: PDE-Controller: LLMs for Autoformalization and Reasoning of PDEs

Abstract:
We present PDE-Controller, a framework that enables large language models (LLMs) to control systems governed by partial differential equations (PDEs). Traditional LLMs have excelled in commonsense reasoning but fall short in rigorous logical reasoning. While recent AI-for-math has made strides in pure mathematics, areas of applied mathematics, particularly PDEs, remain underexplored despite their significant real-world applications. Our approach enables LLMs to transform informal natural language instructions into formal specifications, and then execute reasoning and planning steps to improve the utility of PDE control. We build a holistic solution comprising datasets (both human-written cases and 2 million synthetic samples), math-reasoning models, and novel evaluation metrics, all of which require significant effort. Our PDE-Controller significantly outperforms the latest open-source and GPT models in reasoning, autoformalization, and program synthesis, achieving up to a 62% improvement in utility gain for PDE control. By bridging the gap between language generation and PDE systems, we demonstrate the potential of LLMs in addressing complex scientific and engineering challenges. We promise to release all data, model checkpoints, and code upon acceptance.

Paperid:1205 poster

Authors:Yiwei Wu, Atticus Geiger, Raphaël Millière

Title: How Do Transformers Learn Variable Binding in Symbolic Programs?

Abstract:
Variable binding---the ability to associate variables with values---is fundamental to symbolic computation and cognition. Although classical architectures typically implement variable binding via addressable memory, it is not well understood how modern neural networks lacking built-in binding operations may acquire this capacity. We investigate this by training a Transformer to dereference queried variables in symbolic programs where variables are assigned either numerical constants or other variables. Each program requires following chains of variable assignments up to four steps deep to find the queried value, and also contains irrelevant chains of assignments acting as distractors. Our analysis reveals a developmental trajectory with three distinct phases during training: (1) random prediction of numerical constants, (2) a shallow heuristic prioritizing early variable assignments, and (3) the emergence of a systematic mechanism for dereferencing assignment chains. Using causal interventions, we find that the model learns to exploit the residual stream as an addressable memory space, with specialized attention heads routing information across token positions. This mechanism allows the model to dynamically track variable bindings across layers, resulting in accurate dereferencing. Our results show how Transformer models can learn to implement systematic variable binding without explicit architectural support, bridging connectionist and symbolic approaches.

Paperid:1206 poster

Authors:Xin Zheng, Wei Huang, Chuan Zhou, Ming Li, Shirui Pan

Title: Test-Time Graph Neural Dataset Search With Generative Projection

Abstract:
In this work, we address the test-time adaptation challenge in graph neural networks (GNNs), focusing on overcoming the limitations in flexibility and generalization inherent in existing data-centric approaches. To this end, we propose a novel research problem, test-time graph neural dataset search, which seeks to learn a parameterized test-time graph distribution to enhance the inference performance of unseen test graphs on well-trained GNNs. Specifically, we propose a generative Projection based test-time Graph Neural Dataset Search method, named PGNDS, which maps the unseen test graph distribution back to the known training distribution through a generation process guided by well-trained GNNs. The proposed PGNDS framework consists of three key modules: (1) dual conditional diffusion for GNN-guided generative projection through test-back-to-training distribution mapping; (2) dynamic search from the generative sampling space to select the most expressive test graphs; (3) ensemble inference to aggregate information from original and adapted test graphs. Extensive experiments on real-world graphs demonstrate the superior ability of our proposed PGNDS for improved test-time GNN inference.

Paperid:1207 poster

Authors:Giovanni Catania, Aurélien Decelle, Cyril Furtlehner, Beatriz Seoane

Title: A Theoretical Framework For Overfitting In Energy-based Modeling

Abstract:
We investigate the impact of limited data on training pairwise energy-based models for inverse problems aimed at identifying interaction networks. Utilizing the Gaussian model as testbed, we dissect training trajectories across the eigenbasis of the coupling matrix, exploiting the independent evolution of eigenmodes and revealing that the learning timescales are tied to the spectral decomposition of the empirical covariance matrix. We see that optimal points for early stopping arise from the interplay between these timescales and the initial conditions of training. Moreover, we show that finite data corrections can be accurately modeled through asymptotic random matrix theory calculations and provide the counterpart of generalized cross-validation in the energy based model context. Our analytical framework extends to binary-variable maximum-entropy pairwise models with minimal variations. These findings offer strategies to control overfitting in discrete-variable models through empirical shrinkage corrections, improving the management of overfitting in energy-based generative models. Finally, we propose a generalization to arbitrary energy-based models by deriving the neural tangent kernel dynamics of the score function under the score-matching algorithm.

Paperid:1208 poster

Authors:Yangyi Shen, Jincheng Zhou, Beatrice Bevilacqua, Joshua Robinson, Charilaos Kanatsoulis, Jure Leskovec, Bruno Ribeiro

Title: Zero-Shot Generalization of GNNs over Distinct Attribute Domains

Abstract:
Traditional Graph Neural Networks (GNNs) cannot generalize to new graphs with node attributes different from the training ones, making zero-shot generalization across different node attribute domains an open challenge in graph machine learning. In this paper, we propose STAGE, which encodesstatistical dependenciesbetween attributes rather than individual attribute values, which may differ in test graphs. By assuming these dependencies remain invariant under changes in node attributes, STAGE achieves provable generalization guarantees for a family of domain shifts. Empirically, STAGE demonstrates strong zero-shot performance on medium-sized datasets: when trained on multiple graph datasets with different attribute spaces (varying in types and number) and evaluated on graphs with entirely new attributes, STAGE achieves a relative improvement in Hits@1 between 40% to 103% in link prediction and a 10% improvement in node classification compared to state-of-the-art baselines.

Paperid:1209 poster

Authors:Maximilian Müller, Matthias Hein

Title: Mahalanobis++: Improving OOD Detection via Feature Normalization

Abstract:
Detecting out-of-distribution (OOD) examples is an important task for deploying reliable machine learning models in safety-critial applications. While post-hoc methods based on the Mahalanobis distance applied to pre-logit features are among the most effective for ImageNet-scale OOD detection, their performance varies significantly across models. We connect this inconsistency to strong variations in feature norms, indicating severe violations of the Gaussian assumption underlying the Mahalanobis distance estimation. We show that simple $\ell_2$-normalization of the features mitigates this problem effectively, aligning better with the premise of normally distributed data with shared covariance matrix. Extensive experiments on 44 models across diverse architectures and pretraining schemes show that $\ell_2$-normalization improves the conventional Mahalanobis distance-based approaches significantly and consistently, and outperforms other recently proposed OOD detection methods.

Paperid:1210 poster

Authors:Jiajun Chen, Jin Tian, Christopher John Quinn

Title: Causal Logistic Bandits with Counterfactual Fairness Constraints

Abstract:
Artificial intelligence will play a significant role in decision making in numerous aspects of society. Numerous fairness criteria have been proposed in the machine learning community, but there remains limited investigation into fairness as defined through specified attributes in a sequential decision-making framework. In this paper, we focus on causal logistic bandit problems where the learner seeks to make fair decisions, under a notion of fairness that accounts for counterfactual reasoning. We propose and analyze an algorithm by leveraging primal-dual optimization for constrained causal logistic bandits where the non-linear constraints are a priori unknown and must be learned in time. We obtain sub-linear regret guarantees with leading term similar to that for unconstrained logistic bandits (Lee et al., 2024) while guaranteeing sub-linear constraint violations. We show how to achieve zero cumulative constraint violations with a small increase in the regret bound.

Paperid:1211 poster

Authors:Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

Title: Long-Short Alignment for Effective Long-Context Modeling in LLMs

Abstract:
Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization �?the ability to generalize to sequences longer than those seen during training �?is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role oflong-short alignment�?the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available athttps://github.com/PKU-ML/LongShortAlignment.

Paperid:1212 poster

Authors:Konstantinos Vergopoulos, Mark Niklas Mueller, Martin Vechev

Title: Automated Benchmark Generation for Repository-Level Coding Tasks

Abstract:
Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench -- a benchmark that challenges code agents to generate patches addressing GitHub issues given the full repository as context. The correctness of generated patches is then evaluated by executing a human-written test suite extracted from the repository after the issue's resolution.However, constructing benchmarks like SWE-Bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. Considering so few repositories, selected for their popularity runs the risk of leading to a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios running the riks of misguiding development efforts.In this work, we address this challenge and introduce SetUpAgent, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SetUpAgent, we generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 60% lower agent success rates.

Paperid:1213 poster

Authors:Wenyin Zhou, Christopher Iliffe Sprague, Vsevolod Viliuga, Matteo Tadiello, Arne Elofsson, Hossein Azizpour

Title: Energy-Based Flow Matching for Generating 3D Molecular Structure

Abstract:
Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules' constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design. Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution. In this work, we focus on flow matching and adopt an energy-based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textit{iteratively} map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method's effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.

Paperid:1214 poster

Authors:Masahiro Negishi, Thomas Gärtner, Pascal Welke

Title: WILTing Trees: Interpreting the Distance Between MPNN Embeddings

Abstract:
We investigate the distance function learned by message passing neural networks (MPNNs) in specific tasks, aiming to capture thefunctionaldistance between prediction targets that MPNNs implicitly learn. This contrasts with previous work, which links MPNN distances on arbitrary tasks tostructuraldistances on graphs that ignore task-specific information. To address this gap, we distill the distance between MPNN embeddings into an interpretable graph distance. Our method uses optimal transport on the Weisfeiler Leman Labeling Tree (WILT), where the edge weights reveal subgraphs that strongly influence the distance between embeddings. This approach generalizes two well-known graph kernels and can be computed in linear time. Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain.

Paperid:1215 poster

Authors:Yongjae Shin, Jeonghye Kim, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngsoo Jang, Geon-Hyeong Kim, Jongseong Chae, Youngchul Sung, Kanghoon Lee, Woohyung Lim

Title: Online Pre-Training for Offline-to-Online Reinforcement Learning

Abstract:
Offline-to-online reinforcement learning (RL) aims to integrate the complementary strengths of offline and online RL by pre-training an agent offline and subsequently fine-tuning it through online interactions. However, recent studies reveal that offline pre-trained agents often underperform during online fine-tuning due to inaccurate value estimation caused by distribution shift, with random initialization proving more effective in certain cases. In this work, we propose a novel method, Online Pre-Training for Offline-to-Online RL (OPT), explicitly designed to address the issue of inaccurate value estimation in offline pre-trained agents. OPT introduces a new learning phase, Online Pre-Training, which allows the training of a new value function tailored specifically for effective online fine-tuning. Implementation of OPT on TD3 and SPOT demonstrates an average 30% improvement in performance across a wide range of D4RL environments, including MuJoCo, Antmaze, and Adroit.

Paperid:1216 poster

Authors:Yuwei Niu, Shuo He, Qi Wei, Zongyu Wu, Feng Liu, Lei Feng

Title: Test-Time Multimodal Backdoor Detection by Contrastive Prompting

Abstract:
While multimodal contrastive learning methods (e.g., CLIP) can achieve impressive zero-shot classification performance, recent research has revealed that these methods are vulnerable to backdoor attacks. To defend against backdoor attacks on CLIP, existing defense methods focus on either the pre-training stage or the fine-tuning stage, which would unfortunately cause high computational costs due to numerous parameter updates and are not applicable in black-box settings. In this paper, we provide the first attempt at a computationally efficient backdoor detection method to defend against backdoored CLIP in the inference stage. We empirically find that the visual representations of backdoored images are insensitive to benign and malignant changes in class description texts. Motivated by this observation, we propose BDetCLIP, a novel test-time backdoor detection method based on contrastive prompting. Specifically, we first prompt a language model (e.g., GPT-4) to produce class-related description texts (benign) and class-perturbed random texts (malignant) by specially designed instructions. Then, the distribution difference in cosine similarity between images and the two types of class description texts can be used as the criterion to detect backdoor samples. Extensive experiments validate that our proposed BDetCLIP is superior to state-of-the-art backdoor detection methods, in terms of both effectiveness and efficiency.

Paperid:1217 poster

Authors:Yuanze Wang, Yichao Yan, Shiming Song, Songchang Jin, Yilan Huang, Xingdong Sheng, Dianxi Shi

Title: Enhancing Visual Localization with Cross-Domain Image Generation

Abstract:
Visual localization aims to predict the absolute camera pose for a single query image. However, predominant methods focus on single-camera images and scenes with limited appearance variations, limiting their applicability to cross-domain scenes commonly encountered in real-world applications. Furthermore, the long-tail distribution of cross-domain datasets poses additional challenges for visual localization. In this work, we propose a novel cross-domain data generation method to enhance visual localization methods. To achieve this, we first construct a cross-domain 3DGS to accurately model photometric variations and mitigate the interference of dynamic objects in large-scale scenes. We introduce a text-guided image editing model to enhance data diversity for addressing the long-tail distribution problem and design an effective fine-tuning strategy for it. Then, we develop an anchor-based method to generate high-quality datasets for visual localization. Finally, we introduce positional attention to address data ambiguities in cross-camera images. Extensive experiments show that our method achieves state-of-the-art accuracy, outperforming existing cross-domain visual localization methods by an average of 59% across all domains. Project page:https://yzwang-sjtu.github.io/CDG-Loc.

Paperid:1218 poster

Authors:Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji

Title: VCT: Training Consistency Models with Variational Noise Coupling

Abstract:
Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow matching. Its key innovation is a learned noise-data coupling scheme inspired by Variational Autoencoders, where a data-dependent encoder models noise emission. This enables VCT to adaptively learn noise-to-data pairings, reducing training variance relative to the fixed, unsorted pairings in classical CT. Experiments on multiple image datasets demonstrate significant improvements: our method surpasses baselines, achieves state-of-the-art FID among non-distillation CT approaches on CIFAR-10, and matches SoTA performance on ImageNet 64x64 with only two sampling steps. Code is available athttps://github.com/sony/vct.

Paperid:1219 poster

Authors:Mattia Opper, Siddharth N

Title: Banyan: Improved Representation Learning with Explicit Structure

Abstract:
We present Banyan, a model that efficiently learns semantic representations by leveraging explicit hierarchical structure. While transformers excel at scale, they struggle in low-resource settings. Conversely recent structured models have shown promise as efficient learners, but lack performance. Banyan bridges this gap with two key innovations: an entangled hierarchical tree structure and diagonalized message passing, enabling it to outperform larger transformer models with just 14 non-embedding parameters. It excels in low-resource settings, offering a viable alternative for under-represented languages and highlighting its potential for efficient, interpretable NLP in resource-constrained environments.

Paperid:1220 poster

Authors:Iulia Duta, Pietro Lio

Title: SPHINX: Structural Prediction using Hypergraph Inference Network

Abstract:
The importance of higher-order relations is widely recognized in numerous real-world systems. However, annotating them is a tedious and sometimes even impossible task. Consequently, current approaches for data modelling either ignore the higher-order interactions altogether or simplify them into pairwise connections. To facilitate higher-order processing, even when a hypergraph structure is not available, we introduce SPHINX, a model that learns to infer a latent hypergraph structure in an unsupervised way, solely from the final task-dependent signal. To ensure broad applicability, we design the model to be end-to-end differentiable, capable of generating a discrete hypergraph structure compatible with any modern hypergraph networks, and easily optimizable without requiring additional regularization losses. Through extensive ablation studies and experiments conducted on four challenging datasets, we demonstrate that our model is capable of inferring suitable latent hypergraphs in both transductive and inductive tasks. Moreover, the inferred latent hypergraphs are interpretable and contribute to enhancing the final performance, outperforming existing methods for hypergraph prediction.

Paperid:1221 poster

Authors:Kiran Purohit, Venktesh V, Sourangshu Bhattacharya, Avishek Anand

Title: Sample Efficient Demonstration Selection for In-Context Learning

Abstract:
The in-context learning paradigm with LLMs has been instrumental in advancing a wide range of natural language processing tasks. The selection of few-shot examples (exemplars / demonstration samples) is essential for constructing effective prompts under context-length budget constraints. In this paper, we formulate the exemplar selection task as a top-m best arms identification problem. A key challenge in this setup is the exponentially large number of arms that need to be evaluated to identify the m-best arms. We propose CASE (Challenger Arm Sampling for Exemplar selection), a novel sample-efficient selective exploration strategy that maintains a shortlist of “challenger�?arms, which are current candidates for the top-m arms. In each iteration, only one of the arms from this shortlist or the current top-m set is pulled, thereby reducing sample complexity and, consequently, the number of LLM evaluations. Furthermore, we model the scores of exemplar subsets (arms) using a parameterized linear scoring function, leading to stochastic linear bandits setting. CASE achieves remarkable efficiency gains of up to 7× speedup in runtime while requiring 7× fewer LLM calls (87% reduction) without sacrificing performance compared to state-of-the-art exemplar selection methods. We release our code and data (https://github.com/kiranpurohit/CASE).

Paperid:1222 poster

Authors:Ruilin Wang, Huixia Li, Yuexiao Ma, Xiawu Zheng, Fei Chao, Xuefeng Xiao, Rongrong Ji

Title: polybasic Speculative Decoding Through a Theoretical Perspective

Abstract:
Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and overall computational cost. Our framework supports both standalone implementation and integration with existing speculative techniques, leading to accelerated performance in practice. Experimental results across multiple model families demonstrate that our approach yields speedup ratios ranging from3.31×to4.01×for LLaMA2-Chat 7B, up to3.87×for LLaMA3-8B, up to4.43×for Vicuna-7B and up to3.85×for Qwen2-7B---all while preserving the original output distribution. We release our theoretical proofs and implementation code to facilitate further investigation into polybasic speculative decoding.

Paperid:1223 poster

Authors:Jing Han, Binwei Yan, Tianyu Guo, Zheyuan Bai, Mengyu Zheng, Hanting Chen, Ying Nie

Title: MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles

Abstract:
Despite recent advancements of fine-tuning large language models (LLMs) to facilitate agent tasks, parameter-efficient fine-tuning (PEFT) methodologies for agent remain largely unexplored. In this paper, we introduce three key strategies for PEFT in agent tasks: 1) Inspired by the increasingly dominant \textit{Reason+Action} paradigm, we first decompose the capabilities necessary for the agent tasks into three distinct roles: reasoner, executor, and summarizer. The reasoner is responsible for comprehending the user's query and determining the next role based on the execution trajectory. The executor is tasked with identifying the appropriate functions and parameters to invoke. The summarizer conveys the distilled information from conversations back to the user. 2) We then propose the Mixture-of-Roles (MoR) framework, which comprises three specialized Low-Rank Adaptation (LoRA) groups, each designated to fulfill a distinct role. By focusing on their respective specialized capabilities and engaging in collaborative interactions, these LoRAs collectively accomplish the agent task. 3) To effectively fine-tune the framework, we develop a multi-role data generation pipeline based on publicly available datasets, incorporating role-specific content completion and reliability verification. We conduct extensive experiments and thorough ablation studies on various LLMs and agent benchmarks, demonstrating the effectiveness of the proposed method. This project is publicly available athttps://mor-agent.github.io

Paperid:1224 poster

Authors:M Ganesh Kumar, Blake Bordelon, Jacob A Zavatone-Veth, Cengiz Pehlevan

Title: A Model of Place Field Reorganization During Reward Maximization

Abstract:
When rodents learn to navigate in a novel environment, a high density of place fields emerges at reward locations, fields elongate against the trajectory, and individual fields change spatial selectivity while demonstrating stable behavior. Why place fields demonstrate these characteristic phenomena during learning remains elusive. We develop a normative framework using a reward maximization objective, whereby the temporal difference (TD) error drives place field reorganization to improve policy learning. Place fields are modeled using Gaussian radial basis functions to represent states in an environment, and directly synapse to an actor-critic for policy learning. Each field's amplitude, center, and width, as well as downstream weights, are updated online at each time step to maximize rewards. We demonstrate that this framework unifies three disparate phenomena observed in navigation experiments. Furthermore, we show that these place field phenomena improve policy convergence when learning to navigate to a single target and relearning multiple new targets. To conclude, we develop a simple normative model that recapitulates several aspects of hippocampal place field learning dynamics and unifies mechanisms to offer testable predictions for future experiments.

Paperid:1225 poster

Authors:Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, Jishen Zhao

Title: OrcaLoca: An LLM Agent Framework for Software Issue Localization

Abstract:
Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.

Paperid:1226 poster

Authors:Taekyung Lee, Jaemoo Choi, Jaewoong Choi, Myungjoo Kang

Title: Unpaired Point Cloud Completion via Unbalanced Optimal Transport

Abstract:
Unpaired point cloud completion is crucial for real-world applications, where ground-truth data for complete point clouds are often unavailable. By learning a completion map from unpaired incomplete and complete point cloud data, this task avoids the reliance on paired datasets. In this paper, we propose the \textit{Unbalanced Optimal Transport Map for Unpaired Point Cloud Completion (\textbf{UOT-UPC})} model, which formulates the unpaired completion task as the (Unbalanced) Optimal Transport (OT) problem. Our method employs a Neural OT model learning the UOT map using neural networks. Our model is the first attempt to leverage UOT for unpaired point cloud completion, achieving competitive or superior performance on both single-category and multi-category benchmarks. In particular, our approach is especially robust under the class imbalance problem, which is frequently encountered in real-world unpaired point cloud completion scenarios.

Paperid:1227 poster

Authors:Xiaoshuai Hao, Lingdong Kong, Rong Yin, Pengwei Wang, Jing Zhang, Yunfeng Diao, Shu Zhao

Title: SafeMap: Robust HD Map Construction from Incomplete Observations

Abstract:
Robust high-definition (HD) map construction is vital for autonomous driving, yet existing methods often struggle with incomplete multi-view camera data. This paper presents SafeMap, a novel framework specifically designed to ensure accuracy even when certain camera views are missing. SafeMap integrates two key components: the Gaussian-based Perspective View Reconstruction (G-PVR) module and the Distillation-based Bird’s-Eye-View (BEV) Correction (D-BEVC) module. G-PVR leverages prior knowledge of view importance to dynamically prioritize the most informative regions based on the relationships among available camera views. Furthermore, D-BEVC utilizes panoramic BEV features to correct the BEV representations derived from incomplete observations. Together, these components facilitate comprehensive data reconstruction and robust HD map generation. SafeMap is easy to implement and integrates seamlessly into existing systems, offering a plug-and-play solution for enhanced robustness. Experimental results demonstrate that SafeMap significantly outperforms previous methods in both complete and incomplete scenarios, highlighting its superior performance and resilience.

Paperid:1228 poster

Authors:Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc V Le, Qijun Tan, Yuan Liu

Title: Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs

Abstract:
Existing reinforcement learning (RL) methods for large language models (LLMs) rely on static prompt sets, where prompts are curated a priori, and sampled in a fixed schedule for training, regardless of their usefulness to the RL process. We designeva, the first method that allows LLMs to prioritize and adaptively create useful prompts during RL training by reward signals. In principle,eva(Evolving via A symmetric Self-Play) casts language model training as a game between: (1) a creator, who samples and generates training prompts, and (2) a solver, who generates responses to the prompts.evais simple, suits both offline and online RL for LLMs, and sets a new state-of-the-art on challenging benchmarks without extra human prompts: it improves gemma-2-9b-it’s win-rate on Arena-Hard from 51.6% to 60.1% by DPO and 52.6% to 62.4% by RLOO, surpassing claude-3-opus and nearing gemini-1.5-pro, both are orders of magnitude larger. Further ablation studies showevacan induce meaningful learning curriculum, and effectively scale RL for LLMs beyond static human prompts.

Paperid:1229 poster

Authors:Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy, Neel Nanda

Title: Scaling Sparse Feature Circuits For Studying In-Context Learning

Abstract:
Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model’s knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot. This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we scale the sparse feature circuits methodology of Marks et al. (2024) to the Gemma 1 2B model for the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers.

Paperid:1230 poster

Authors:Haosen Ge, Hamsa Bastani, Osbert Bastani

Title: Stochastic Online Conformal Prediction with Semi-Bandit Feedback

Abstract:
Conformal prediction has emerged as an effective strategy for uncertainty quantification by modifying a model to output sets of labels instead of a single label. These prediction sets come with the guarantee that they contain the true label with high probability. However, conformal prediction typically requires a large calibration dataset of i.i.d. examples. We consider the online learning setting, where examples arrive over time, and the goal is to construct prediction sets dynamically. Departing from existing work, we assume semi-bandit feedback, where weonly observe the true label if it is contained in the prediction set. For instance, consider calibrating a document retrieval model to a new domain; in this setting, a user would only be able to provide the true label if the target document is in the prediction set of retrieved documents. We propose a novel conformal prediction algorithm targeted at this setting, and prove that it obtains sublinear regret compared to the optimal conformal predictor. We evaluate our algorithm on a retrieval task, an image classification task, and an auction price-setting task, and demonstrate that it empirically achieves good performance compared to several baselines.

Paperid:1231 poster

Authors:Qiao Sun, Zhicheng Jiang, Hanhong Zhao, Kaiming He

Title: Is Noise Conditioning Necessary for Denoising Generative Models?

Abstract:
It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a mathematical analysis of the error introduced by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditionalmodel that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.

Paperid:1232 poster

Authors:Yi Feng, Kaito Fujii, Stratis Skoulakis, Xiao Wang, Volkan Cevher

Title: Continuous-Time Analysis of Heavy Ball Momentum in Min-Max Games

Abstract:
Since Polyak's pioneering work, heavy ball (HB) momentum has been widely studied in minimization. However, its role in min-max games remains largely unexplored. As a key component of practical min-max algorithms like Adam, this gap limits their effectiveness. In this paper, we present a continuous-time analysis for HB with simultaneous and alternating update schemes in min-max games. Locally, we provesmallermomentum enhances algorithmic stability by enabling local convergence across a wider range of step sizes, with alternating updates generally converging faster. Globally, we study the implicit regularization of HB, and findsmallermomentum guides algorithms trajectories towards shallower slope regions of the loss landscapes, with alternating updates amplifying this effect. Surprisingly, all these phenomena differ from those observed in minimization, wherelargermomentum yields similar effects. Our results reveal fundamental differences between HB in min-max games and minimization, and numerical experiments further validate our theoretical results.

Paperid:1233 poster

Authors:Yuhang Li, Guoxu Zhou, Zhenhao Huang, Xinqi Chen, Yuning Qiu, Qibin Zhao

Title: Tensor Decomposition Based Memory-Efficient Incremental Learning

Abstract:
Class-Incremental Learning (CIL) has gained considerable attention due to its capacity to accommodate new classes during learning. Replay-based methods demonstrate state-of-the-art performance in CIL but suffer from high memory consumption to save a set of old exemplars for revisiting. To address this challenge, many memory-efficient replay methods have been developed by exploiting image compression techniques. However, the gains are often bittersweet when pixel-level compression methods are used. Here, we present a simple yet efficient approach that employs tensor decomposition to address these limitations. This method fully exploits the low intrinsic dimensionality and pixel correlation of images to achieve high compression efficiency while preserving sufficient discriminative information, significantly enhancing performance. We also introduce a hybrid exemplar selection strategy to improve the representativeness and diversity of stored exemplars. Extensive experiments across datasets with varying resolutions consistently demonstrate that our approach substantially boosts the performance of baseline methods, showcasing strong generalization and robustness.

Paperid:1234 poster

Authors:Shiyu Wu, Jing Liu, Jing Li, Yequan Wang

Title: Few-Shot Learner Generalizes Across AI-Generated Image Detection

Abstract:
Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, these detectors suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space for effectively distinguishing unseen fake images using very few samples. Experiments show that FSD achieves state-of-the-art performance by+11.6average accuracy on the GenImage dataset with only10additional samples. More importantly, our method is better capable of capturing the intra-category commonality in unseen images without further training. Our code is available athttps://github.com/teheperinko541/Few-Shot-AIGI-Detector.

Paperid:1235 poster

Authors:Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang

Title: Fast Large Language Model Collaborative Decoding via Speculation

Abstract:
Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduceCollaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding—where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration amongnmodels and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is1.11x�?.23xfaster than standard collaborative decoding without compromising generation quality. Our code is available athttps://github.com/Kamichanw/CoS/.

Paperid:1236 poster

Authors:Xiukun Wei, Xueru Zhang

Title: Self-Consuming Generative Models with Adversarially Curated Data

Abstract:
Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates “self-consuming loops,�?which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness and stability of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival’s model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.

Paperid:1237 poster

Authors:YUE JIANG, Yile Chen, Xiucheng Li, Qin Chao, SHUAI LIU, Gao Cong

Title: FSTLLM: Spatio-Temporal LLM for Few Shot Time Series Forecasting

Abstract:
Time series forecasting fundamentally relies on accurately modeling complex interdependencies and shared patterns within time series data. Recent advancements, such as Spatio-Temporal Graph Neural Networks (STGNNs) and Time Series Foundation Models (TSFMs), have demonstrated promising results by effectively capturing intricate spatial and temporal dependencies across diverse real-world datasets. However, these models typically require large volumes of training data and often struggle in data-scarce scenarios. To address this limitation, we propose a framework named Few-shot Spatio-Temporal Large Language Models (FSTLLM), aimed at enhancing model robustness and predictive performance in few-shot settings. FSTLLM leverages the contextual knowledge embedded in Large Language Models (LLMs) to provide reasonable and accurate predictions. In addition, it supports the seamless integration of existing forecasting models to further boost their predicative capabilities. Experimental results on real-world datasets demonstrate the adaptability and consistently superior performance of FSTLLM over major baseline models by a significant margin. Our code is available at:https://github.com/JIANGYUE61610306/FSTLLM.

Paperid:1238 poster

Authors:Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao

Title: Discriminative Policy Optimization for Token-Level Reward Models

Abstract:
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12× faster than ORM on GSM8K and 11× faster than step-level PRM on MATH. Code and data are available athttps://github.com/homzer/Q-RM.

Paperid:1239 poster

Authors:Yazhou Ren, Junlong Ke, Zichen Wen, Tianyi Wu, Yang Yang, Xiaorong Pu, Lifang He

Title: Multi-View Graph Clustering via Node-Guided Contrastive Encoding

Abstract:
Multi-view clustering has gained significant attention for integrating multi-view information in multimedia applications. With the growing complexity of graph data, multi-view graph clustering (MVGC) has become increasingly important. Existing methods primarily use Graph Neural Networks (GNNs) to encode structural and feature information, but applying GNNs within contrastive learning poses specific challenges, such as integrating graph data with node features and handling both homophilic and heterophilic graphs. To address these challenges, this paper introduces Node-Guided Contrastive Encoding (NGCE), a novel MVGC approach that leverages node features to guide embedding generation. NGCE enhances compatibility with GNN filtering, effectively integrates homophilic and heterophilic information, and strengthens contrastive learning across views. Extensive experiments demonstrate its robust performance on six homophilic and heterophilic multi-view benchmark datasets.

Paperid:1240 poster

Authors:Xing Xi, Xing Fu, Weiqiang Wang, Ronghua Luo

Title: OW-VAP: Visual Attribute Parsing for Open World Object Detection

Abstract:
Open World Object Detection (OWOD) requires the detector to continuously identify and learn new categories. Existing methods rely on the large language model (LLM) to describe the visual attributes of known categories and use these attributes to mark potential objects. The performance of such methods is influenced by the accuracy of LLM descriptions, and selecting appropriate attributes during incremental learning remains a challenge. In this paper, we propose a novel OWOD framework, termed OW-VAP, which operates independently of LLM and requires only minimal object descriptions to detect unknown objects. Specifically, we propose a Visual Attribute Parser (VAP) that parses the attributes of visual regions and assesses object potential based on the similarity between these attributes and the object descriptions. To enable the VAP to recognize objects in unlabeled areas, we exploit potential objects within background regions. Finally, we propose Probabilistic Soft Label Assignment (PSLA) to prevent optimization conflicts from misidentifying background as foreground. Comparative results on the OWOD benchmark demonstrate that our approach surpasses existing state-of-the-art methods with a +13 improvement in U-Recall and a +8 increase in U-AP for unknown detection capabilities. Furthermore, OW-VAP approaches the unknown recall upper limit of the detector.

Paperid:1241 poster

Authors:Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, Erik J Bekkers

Title: On the Importance of Embedding Norms in Self-Supervised Learning

Abstract:
Self-supervised learning (SSL) allows training data representations without a supervised signal and has become an important paradigm in machine learning. Most SSL methods employ the cosine similarity between embedding vectors and hence effectively embed data on a hypersphere. While this seemingly implies that embedding norms cannot play any role in SSL, a few recent works have suggested that embedding norms have properties related to network convergence and confidence. In this paper, we resolve this apparent contradiction and systematically establish the embedding norm's role in SSL training. Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. Additionally, we show that manipulating embedding norms can have large effects on convergence speed. Our findings demonstrate that SSL embedding norms are integral to understanding and optimizing network behavior.

Paperid:1242 poster

Authors:Jing Ma, Hanlin Li, Xiang Xiang

Title: PTTA: Purifying Malicious Samples for Test-Time Model Adaptation

Abstract:
Test-Time Adaptation (TTA) enables deep neural networks to adapt to arbitrary distributions during inference. Existing TTA algorithms generally tend to select benign samples that help achieve robust online prediction and stable self-training. Although malicious samples that would undermine the model's optimization should be filtered out, it also leads to a waste of test data. To alleviate this issue, we focus on how to make full use of the malicious test samples for TTA by transforming them into benign ones, and propose a plug-and-play method, PTTA. The core of our solution lies in the purification strategy, which retrieves benign samples having opposite effects on the objective function to perform Mixup with malicious samples, based on a saliency indicator for encoding benign and malicious data. This strategy results in effective utilization of the information in malicious samples and an improvement of the models' online test accuracy. In this way, we can directly apply the purification loss to existing TTA algorithms without the need to carefully adjust the sample selection threshold. Extensive experiments on four types of TTA tasks as well as classification, segmentation, and adversarial defense demonstrate the effectiveness of our method. Code is available athttps://github.com/HAIV-Lab/PTTA.

Paperid:1243 poster

Authors:Xinsong Ma, Xin Zou, Weiwei Liu

Title: A Online Statistical Framework for Out-of-Distribution Detection

Abstract:
Out-of-distribution (OOD) detection task is significant in reliable and safety-critical applications. Existing approaches primarily focus on developing the powerful score function, but overlook the design of decision-making rules based on these score function. In contrast to prior studies, we rethink the OOD detection task from an perspective of online multiple hypothesis testing. We then propose a novel generalized LOND (g-LOND) algorithm to solve the above problem. Theoretically, the g-LOND algorithm controls false discovery rate (FDR) at pre-specified level without the consideration for the dependence between the p-values. Furthermore, we prove that the false positive rate (FPR) of the g-LOND algorithm converges to zero in probability based on the generalized Gaussian-like distribution family. Finally, the extensive experimental results verify the effectiveness of g-LOND algorithm for OOD detection.

Paperid:1244 poster

Authors:Filip Ekström Kelvinius, Oskar B. Andersson, Abhijith S Parackal, Dong Qian, Rickard Armiento, Fredrik Lindsten

Title: WyckoffDiff -- A Generative Diffusion Model for Crystal Symmetry

Abstract:
Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fréchet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation. As a proof-of-concept study, we use WyckoffDiff to find new materials below the convex hull of thermodynamical stability.

Paperid:1245 poster

Authors:Xiancheng Sun, Senmao Ma, Shengxi Li, Mai Xu, Jingyuan Xia, Lai Jiang, Xin Deng, Jiali Wang

Title: Spherical-Nested Diffusion Model for Panoramic Image Outpainting

Abstract:
Panoramic image outpainting acts as a pivotal role in immersive content generation, allowing for seamless restoration and completion of panoramic content. Given the fact that the majority of generative outpainting solutions operates on planar images, existing methods for panoramic images address the sphere nature by soft regularisation during the end-to-end learning, which still fails to fully exploit the spherical content. In this paper, we set out the first attempt to impose the sphere nature in the design of diffusion model, such that the panoramic format is intrinsically ensured during the learning procedure, named as spherical-nested diffusion (SpND) model. This is achieved by employing spherical noise in the diffusion process to address the structural prior, together with a newly proposed spherical deformable convolution (SDC) module to intrinsically learn the panoramic knowledge. Upon this, the proposed method is effectively integrated into a pre-trained diffusion model, outperforming existing state-of-the-art methods for panoramic image outpainting. In particular, our SpND method reduces the FID values by more than 50% against the state-of-the-art PanoDiffusion method. Codes are publicly available at \url{https://github.com/chronos123/SpND}.

Paperid:1246 poster

Authors:Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan

Title: CommVQ: Commutative Vector Quantization for KV Cache Compression

Abstract:
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at:https://github.com/UMass-Embodied-AGI/CommVQ.

Paperid:1247 poster

Authors:Puning Yang, Qizhou Wang, Zhuo Huang, Tongliang Liu, Chengqi Zhang, Bo Han

Title: Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning

Abstract:
Loss reweighting has shown significant benefits for machine unlearning with large language models (LLMs). However, their exact functionalities are left unclear and the optimal strategy remains an open question, thus impeding the understanding and improvement of existing methodologies. In this paper, we identify two distinct goals of loss reweighting, namely, Saturation and Importance---the former indicates that those insufficiently optimized data should be emphasized, while the latter stresses some critical data that are most influential for loss minimization. To study their usefulness, we design specific reweighting strategies for each goal and evaluate their respective effects on unlearning. We conduct extensive empirical analyses on well-established benchmarks, and summarize some important observations as follows: (i) Saturation enhances efficacy more than importance-based reweighting, and their combination can yield additional improvements. (ii) Saturation typically allocates lower weights to data with lower likelihoods, whereas importance-based reweighting does the opposite. (iii) The efficacy of unlearning is also largely influenced by the smoothness and granularity of the weight distributions. Based on these findings, we propose SatImp, a simple reweighting method that combines the advantages of both saturation and importance. Empirical results on extensive datasets validate the efficacy of our method, potentially bridging existing research gaps and indicating directions for future research. Our code is available athttps://github.com/tmlr-group/SatImp.

Paperid:1248 poster

Authors:Tianyuan Zou, Yang Liu, Peng Li, Yufei Xiong, Jianqing Zhang, Jingjing Liu, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang

Title: Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion

Abstract:
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private dataSynthesis viaWeighted multiplePre-trained generative models framework, named asWASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Qvoting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models. Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available athttps://github.com/LindaLydia/WASP.

Paperid:1249 poster

Authors:Zhongyang Li, Ziyue Li, Tianyi Zhou

Title: R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Abstract:
In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by different downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method ''Re-Routing inTest-Time (R2-T2)'' that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and significantly improves state-of-the-art LMMs' performance on challenging multimodal benchmarks of diverse tasks, without training any parameters in the base model. Our code can be accessed here.

Paperid:1250 poster

Authors:Jiachang Liu, Soroosh Shafiee, Andrea Lodi

Title: Scalable First-order Method for Certifying Optimal k-Sparse GLMs

Abstract:
This paper investigates the problem of certifying optimality for sparse generalized linear models (GLMs), where sparsity is enforced through an�?cardinality constraint. While branch-and-bound (BnB) frameworks can certify optimality by pruning nodes using dual bounds, existing methods for computing these bounds are either computationally intensive or exhibit slow convergence, limiting their scalability to large-scale problems. To address this challenge, we propose a first-order proximal gradient algorithm designed to solve the perspective relaxation of the problem within a BnB framework. Specifically, we formulate the relaxed problem as a composite optimization problem and demonstrate that the proximal operator of the non-smooth component can be computed exactly in log-linear time complexity, eliminating the need to solve a computationally expensive second-order cone program. Furthermore, we introduce a simple restart strategy that enhances convergence speed while maintaining low per-iteration complexity. Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems.

Paperid:1251 poster

Authors:Muskan Dosi, Chiranjeev Chiranjeev, Kartik Thakral, Mayank Vatsa, Richa Singh

Title: Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres

Abstract:
Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce \textbf{HyperSphereDiff} to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold.

Paperid:1252 poster

Authors:Jing Qiao, Yu Liu, Zengzhe Chen, Mingyi Li, YUAN YUAN, Xiao Zhang, Dongxiao Yu

Title: PDUDT: Provable Decentralized Unlearning under Dynamic Topologies

Abstract:
This paper investigates decentralized unlearning, aiming to eliminate the impact of a specific client on the whole decentralized system. However, decentralized communication characterizations pose new challenges for effective unlearning: the indirect connections make it difficult to trace the specific client's impact, while the dynamic topology limits the scalability of retraining-based unlearning methods. In this paper, we propose the firstProvableDecentralizedUnlearning algorithm underDynamicTopologies called PDUDT. It allows clients to eliminate the influence of a specific client without additional communication or retraining. We provide rigorous theoretical guarantees for PDUDT, showing it is statistically indistinguishable from perturbed retraining. Additionally, it achieves an efficient convergence rate of $\mathcal{O}(\frac{1}{T})$ in subsequent learning, where $T$ is the total communication rounds. This rate matches state-of-the-art results. Experimental results show that compared with the Retrain method, PDUDT saves more than 99% of unlearning time while achieving comparable unlearning performance.

Paperid:1253 poster

Authors:Tuan Quang Dam, Kishan Panaganti, Brahim Driss, Adam Wierman

Title: Online Robust Reinforcement Learning Through Monte-Carlo Planning

Abstract:
Monte Carlo Tree Search (MCTS) is a powerful framework for solving complex decision-making problems, yet it often relies on the assumption that the simulator and the real-world dynamics are identical. Although this assumption helps achieve the success of MCTS in games like Chess, Go, and Shogi, the real-world scenarios incur ambiguity due to their modeling mismatches in low-fidelity simulators. In this work, we present a new robust variant of MCTS that mitigates dynamical model ambiguities. Our algorithm addresses transition dynamics and reward distribution ambiguities to bridge the gap between simulation-based planning and real-world deployment. We incorporate a robust power mean backup operator and carefully designed exploration bonuses to ensure finite-sample convergence at every node in the search tree. We show that our algorithm achieves a convergence rate of $\mathcal{O}(n^{-1/2})$ for the value estimation at the root node, comparable to that of standard MCTS. Finally, we provide empirical evidence that our method achieves robust performance in planning problems even under significant ambiguity in the underlying reward distribution and transition dynamics.

Paperid:1254 poster

Authors:Qiuxia Lin, Rongyu Chen, Kerui Gu, Angela Yao

Title: Semantics-aware Test-time Adaptation for 3D Human Pose Estimation

Abstract:
This work highlights a semantics misalignment in 3D human pose estimation. For the task of test-time adaptation, the misalignment manifests as overly smoothed and unguided predictions. The smoothing settles predictions towards some average pose. Furthermore, when there are occlusions or truncations, the adaptation becomes fully unguided. To this end, we pioneer the integration of a semantics-aware motion prior for the test-time adaptation of 3D pose estimation. We leverage video understanding and a well-structured motion-text space to adapt the model motion prediction to adhere to video semantics during test time. Additionally, we incorporate a missing 2D pose completion based on the motion-text similarity. The pose completion strengthens the motion prior's guidance for occlusions and truncations. Our method significantly improves state-of-the-art 3D human pose estimation TTA techniques, with more than 12% decrease in PA-MPJPE on 3DPW and 3DHP.

Paperid:1255 poster

Authors:Peter Holderrieth, Michael Samuel Albergo, Tommi Jaakkola

Title: LEAPS: A discrete neural sampler via locally equivariant networks

Abstract:
We proposeLEAPS, an algorithm to sample from discrete distributions known up to normalization by learning a rate matrix of a continuous-time Markov chain (CTMC). LEAPS can be seen as a continuous-time formulation of annealed importance sampling and sequential Monte Carlo methods, extended so that the variance of the importance weights is offset by the inclusion of the CTMC. To derive these importance weights, we introduce a set of Radon-Nikodym derivatives of CTMCs over their path measures. Because the computation of these weights is intractable with standard neural network parameterizations of rate matrices, we devise a new compact representation for rate matrices via what we call \textit{locally equivariant} functions. To parameterize them, we introduce a family of locally equivariant multilayer perceptrons, attention layers, and convolutional networks, and provide an approach to make deep networks that preserve the local equivariance. This property allows us to propose a scalable training algorithm for the rate matrix such that the variance of the importance weights associated to the CTMC are minimal. We demonstrate the efficacy of LEAPS on problems in statistical physics. We provide code inhttps://github.com/malbergo/leaps/.

Paperid:1256 poster

Authors:Anders Aamand, Justin Y. Chen, Siddharth Gollapudi, Sandeep Silwal, Hao WU

Title: Improved Approximations for Hard Graph Problems using Predictions

Abstract:
We design improved approximation algorithms for NP-hard graph problems by incorporating predictions (e.g., learned from past data). Our prediction model builds upon and extends the $\varepsilon$-prediction framework by Cohen-Addad, d'Orsi, Gupta, Lee, and Panigrahi (NeurIPS 2024). We consider an edge-based version of this model, where each edge provides two bits of information, corresponding to predictions about whether each of its endpoints belong to an optimal solution. Even with weak predictions where each bit is only $\varepsilon$-correlated with the true solution, this information allows us to break approximation barriers in the standard setting. We develop algorithms with improved approximation ratios for MaxCut, Vertex Cover, Set Cover, and Maximum Independent Set problems (among others). Across these problems, our algorithms share a unifying theme, where we separately satisfy constraints related to high degree vertices (using predictions) and low-degree vertices (without using predictions) and carefully combine the answers.

Paperid:1257 poster

Authors:Samir Khaki, Xiuyu Li, Junxian Guo, Ligeng Zhu, Konstantinos N. Plataniotis, Amir Yazdanbakhsh, Kurt Keutzer, Song Han, Zhijian Liu

Title: SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity

Abstract:
Fine-tuning LLMs is both computationally and memory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost. In some cases, they may even slow down fine-tuning. In this paper, we introduce SparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We propose a lightweight, training-free SVD sparsity estimator that dynamically selects a sparse subset of weights for loss and gradient computation. Also, we systematically analyze and address sensitivity across layers, tokens, and training steps. Our experimental results show that SparseLoRA reduces computational cost by up to $2.0\times$ and a measured speedup of up to $1.5\times$ while maintaining accuracy across various downstream tasks, including commonsense and arithmetic reasoning, code generation, and instruction following.

Paperid:1258 poster

Authors:Zihan Chen, Song Wang, Zhen Tan, Jundong Li, Cong Shen

Title: MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning

Abstract:
In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we proposeMany-ShotAdaptivePseudo-LabEling, namelyMAPLE, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL, without significant labeling costs. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data. Our code is provided athttps://github.com/Chen-1031/MAPLE_ICL.

Paperid:1259 poster

Authors:Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, Xiao Huang

Title: Structure-Guided Large Language Models for Text-to-SQL Generation

Abstract:
Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However, LLMs often struggle to fully exploit and comprehend the user intention and complex structures of databases. Decomposition-based methods have been proposed to enhance the performance of LLMs on complex tasks, but decomposing SQL generation into subtasks is non-trivial due to the declarative structure of SQL syntax and the intricate connections between query concepts and database elements. In this paper, we propose a novel Structure GUided text-to-SQL framework ( SGU-SQL) that incorporates syntax-based prompting to enhance the SQL generation capabilities of LLMs. Specifically, SGU-SQL establishes structure-aware links between user queries and database schema and recursively decomposes the complex generation task using syntax-based prompting to guide LLMs in incrementally constructing target SQLs. Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL baselines.

Paperid:1260 poster

Authors:Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab

Title: When can in-context learning generalize out of task distribution?

Abstract:
In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

Paperid:1261 poster

Authors:Scott Emmons, Caspar Oesterheld, Vincent Conitzer, Stuart Russell

Title: Observation Interference in Partially Observable Assistance Games

Abstract:
We study partially observable assistance games (POAGs), a model of the human-AI value alignment problem which allows the human and the AI assistant to have partial observations. Motivated by concerns of AI deception, we study a qualitatively new phenomenon made possible by partial observability: would an AI assistant ever have an incentive to interfere with the human's observations? First, we prove that sometimes an optimal assistant must take observation-interferingactions, even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. Though this result seems to contradict the classic theorem from single-agent decision making that the value of perfect information is nonnegative, we resolve this seeming contradiction by developing a notion of interference defined on entirepolicies. This can be viewed as an extension of the classic result that the value of perfect information is nonnegative into the cooperative multiagent setting. Second, we prove that if the human is simply making decisions based on their immediate outcomes, the assistant might need to interfere with observations as a way to query the human's preferences. We show that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate their preferences to the assistant. Third, we show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations. Finally, we use an experimental model to analyze tradeoffs faced by the AI assistant in practice when considering whether or not to take observation-interfering actions.

Paperid:1262 poster

Authors:Gábor Pituk, Vik Shirvaikar, Tom Rainforth

Title: Do Bayesian Neural Networks Actually Behave Like Bayesian Models?

Abstract:
We empirically investigate how well popular approximate inference algorithms for Bayesian Neural Networks (BNNs) respect the theoretical properties of Bayesian belief updating. We find strong evidence on synthetic regression and real-world image classification tasks that common BNN algorithms such as variational inference, Laplace approximation, SWAG, and SGLD fail to update in a consistent manner, forget about old data under sequential updates, and violate the predictive coherence properties that would be expected of Bayesian methods. These observed behaviors imply that care should be taken when treating BNNs as true Bayesian models, particularly when using them beyond static prediction settings, such as for active, continual, or transfer learning.

Paperid:1263 poster

Authors:Jimmy Wang, Thomas P Zollo, Richard Zemel, Hongseok Namkoong

Title: Adaptive Elicitation of Latent Information Using Natural Language

Abstract:
Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.

Paperid:1264 poster

Authors:Menglin Xia, Victor Rühle, Saravan Rajmohan, Reza Shokri

Title: Minerva: A Programmable Memory Test Benchmark for Language Models

Abstract:
How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, performing basic operations when inputs are structured into distinct blocks, and maintaining state while operating on memory, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to perform more complex, integrated tasks. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.

Paperid:1265 poster

Authors:Mislav Balunovic, Jasper Dekoninck, Nikola Jovanović, Ivo Petrov, Martin Vechev

Title: MathConstruct: Challenging LLM Reasoning with Constructive Proofs

Abstract:
While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem simplicity or the viability of guessing or memorization. Crucially, they capture only a narrow subset of relevant math problems. To address this research gap, we introduce MathConstruct, a new benchmark of 127 challenging problems sourced from various math competitions, which targetsconstructive proofs, a widely encountered problem type requiring the construction of mathematical objects with specific properties. These proofs are particularly suitable for LLM evaluation, as solution correctness can be easily verified. Our automated verifiers also enable MathConstruct to generate problem variations, used to evaluate robustness. State-of-the-art LLMs solve only 41% of MathConstruct problems, highlighting its complexity and importance for LLM evaluation.

Paperid:1266 poster

Authors:Rashid Mushkani, Shravan Nayak, Hugo Berard, Allison Cohen, Shin Koseki, Hadrien Bertrand

Title: LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces

Abstract:
We introduce theLocal Intersectional Visual Spaces(LIVS) dataset, a benchmark for multi-criteria alignment, developed through a two-year participatory process with 30 community organizations to support the pluralistic alignment of text-to-image (T2I) models in inclusive urban planning. The dataset encodes 37,710 pairwise comparisons across 13,462 images, structured along six criteria—Accessibility, Safety, Comfort, Invitingness, Inclusivity, and Diversity—derived from 634 community-defined concepts. Using Direct Preference Optimization (DPO), we fine-tune Stable Diffusion XL to reflect multi-criteria spatial preferences and evaluate the LIVS dataset and the fine-tuned model through four case studies: (1) DPO increases alignment with annotated preferences, particularly when annotation volume is high; (2) preference patterns vary across participant identities, underscoring the need for intersectional data; (3) human-authored prompts generate more distinctive visual outputs than LLM-generated ones, influencing annotation decisiveness; and (4) intersectional groups assign systematically different ratings across criteria, revealing the limitations of single-objective alignment. While DPO improves alignment under specific conditions, the prevalence of neutral ratings indicates that community values are heterogeneous and often ambiguous. LIVS provides a benchmark for developing T2I models that incorporate local, stakeholder-driven preferences, offering a foundation for context-aware alignment in spatial design.

Paperid:1267 poster

Authors:Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

Title: Robust Multimodal Large Language Models Against Modality Conflict

Abstract:
Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

Paperid:1268 poster

Authors:Andrei Panferov, Jiale Chen, Soroush Tabesh, Mahdi Nikdan, Dan Alistarh

Title: QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Abstract:
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models bydirectly trainingover such representations, i.e.,Quantization-Aware Training (QAT), is still open: for example, a recent study put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, for which we demonstrate optimality at 4-bits and stable convergence as low as 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a newtrust gradient estimatorbased on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available athttps://github.com/IST-DASLab/QuEST.

Paperid:1269 poster

Authors:Shiqing Gao, Yihang Zhou, Shuai Shao, Haoyu Luo, Yiheng Bing, Jiaxin Ding, Luoyi Fu, Xinbing Wang

Title: Extreme Value Policy Optimization for Safe Reinforcement Learning

Abstract:
Ensuring safety is a critical challenge in applying Reinforcement Learning (RL) to real-world scenarios. Constrained Reinforcement Learning (CRL) addresses this by maximizing returns under predefined constraints, typically formulated as the expected cumulative cost. However, expectation-based constraints overlook rare but high-impact extreme value events in the tail distribution, such as black swan incidents, which can lead to severe constraint violations. To address this issue, we propose the Extreme Value policy Optimization (EVO) algorithm, leveraging Extreme Value Theory (EVT) to model and exploit extreme reward and cost samples, reducing constraint violations. EVO introduces an extreme quantile optimization objective to explicitly capture extreme samples in the cost tail distribution. Additionally, we propose an extreme prioritization mechanism during replay, amplifying the learning signal from rare but high-impact extreme samples. Theoretically, we establish upper bounds on expected constraint violations during policy updates, guaranteeing strict constraint satisfaction at a zero-violation quantile level. Further, we demonstrate that EVO achieves a lower probability of constraint violations than expectation-based methods and exhibits lower variance than quantile regression methods. Extensive experiments show that EVO significantly reduces constraint violations during training while maintaining competitive policy performance compared to baselines.

Paperid:1270 poster

Authors:Ali Zindari, Parham Yazdkhasti, Anton Rodomanov, Tatjana Chavdarova, Sebastian U Stich

Title: Decoupled SGDA for Games with Intermittent Strategy Communication

Abstract:
We introduceDecoupled SGDA, a novel adaptation of Stochastic Gradient Descent Ascent (SGDA) tailored for multiplayer games with intermittent strategy communication. Unlike prior methods, Decoupled SGDA enables players to update strategies locally using outdated opponent strategies, significantly reducing communication overhead. For Strongly-Convex-Strongly-Concave (SCSC) games, it achieves near-optimal communication complexity comparable to the best-known GDA rates. Forweakly coupledgames where the interaction between players is lower relative to the non-interactive part of the game, Decoupled SGDA significantly reduces communication costs compared to standard SGDA. Additionally,Decoupled SGDAoutperforms federated minimax approaches in noisy, imbalanced settings. These results establishDecoupled SGDAas a transformative approach for distributed optimization in resource-constrained environments.

Paperid:1271 poster

Authors:Gen Li, Yao Wan, Hongyu Zhang, Zhou Zhao, Wenbin Jiang, Xuanhua Shi, Hai Jin, Zheng Wang

Title: Dataflow-Guided Neuro-Symbolic Language Models for Type Inference

Abstract:
Language Models (LMs) are increasingly used for type inference, aiding in error detection and software development. Some real-world deployments of LMs require the model to run on local machines to safeguard the intellectual property of the source code. This setting often limits the size of the LMs that can be used. We present Nester, the first neuro-symbolic approach that enhances LMs for type inference by integrating symbolic learning without increasing model size. Nester breaks type inference into sub-tasks based on the data and control flow of the input code, encoding them as a modular high-level program. This program executes multi-step actions, such as evaluating expressions and analyzing conditional branches of the target code, combining static typing with LMs to infer potential types. Evaluated on the ManyTypes4Py dataset in Python, Nester outperforms two state-of-the-art type inference methods (HiTyper and TypeGen), achieving 70.7% Top-1 Exact Match, which is 18.3% and 3.6% higher than HiTyper and TypeGen, respectively. For complex type annotations like typing.Optional and typing.Union, Nester achieves 51.0% and 16.7%, surpassing TypeGen by 28.3% and 5.8%.

Paperid:1272 poster

Authors:Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, Katherine Metcalf

Title: Aligning LLMs by Predicting Preferences from User Writing Samples

Abstract:
Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and individualized nature of human preferences. This paper introduces PROSE, a method designed to enhance the precision of preference descriptions inferred from user writing samples. PROSE incorporates two key elements: (1) iterative refinement of inferred preferences, and (2) verification of inferred preferences across multiple user writing samples. We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent's generations over CIPHER (a state-of-the-art method for inferring preferences) by 33%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9% improvement over ICL alone. Code:https://github.com/apple/ml-predict

Paperid:1273 poster

Authors:Qian-Yuan Tang, Yufei Gu, Yunfeng Cai, Mingming Sun, Ping Li, zhou Xun, Zeke Xie

Title: Investigating the Overlooked Hessian Structure: From CNNs to LLMs

Abstract:
It is well-known that the Hessian of deep loss landscape matters to optimization and generalization of deep learning. Previous studies reported a rough Hessian structure in deep learning, which consists of two components, a small number of large eigenvalues and a large number of nearly-zero eigenvalues. To the best of our knowledge, we are the first to report that a simple but overlooked power-law Hessian structure exists in well-trained deep neural networks, including Convolutional Neural Networks (CNNs) and Large Language Models (LLMs). Moreover, we provide a maximum-entropy theoretical interpretation for the power-law Hessian structure and theoretically demonstrate the existence of robust and low-dimensional subspace of deep neural networks. Our extensive experiments using the proposed power-law spectral method demonstrate that the power-law Hessian spectra critically relate to multiple important behaviors of deep learning, including optimization, generalization, and overparameterization. Notably, we discover that the power-law Hessian structure of a given LLM can effectively predict generalization during training, while conventional sharpness-based generalization measures that often works well on CNNs become nearly useless for as a generalization predictor of LLMs.

Paperid:1274 poster

Authors:Zijian Liang, Kai Niu, Changshuo Wang, Jin Xu, Ping Zhang

Title: Synonymous Variational Inference for Perceptual Image Compression

Abstract:
Recent contributions of semantic information theory reveal the set-element relationship between semantic and syntactic information, represented as synonymous relationships. In this paper, we propose a synonymous variational inference (SVI) method based on this synonymity viewpoint to re-analyze the perceptual image compression problem. It takes perceptual similarity as a typical synonymous criterion to build an ideal synonymous set (Synset), and approximate the posterior of its latent synonymous representation with a parametric density by minimizing a partial semantic KL divergence. This analysis theoretically proves that the optimization direction of perception image compression follows a triple tradeoff that can cover the existing rate-distortion-perception schemes. Additionally, we introduce synonymous image compression (SIC), a new image compression scheme that corresponds to the analytical process of SVI, and implement a progressive SIC codec to fully leverage the model's capabilities. Experimental results demonstrate comparable rate-distortion-perception performance using a single progressive SIC codec, thus verifying the effectiveness of our proposed analysis method.

Paperid:1275 poster

Authors:Jean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu

Title: Understanding Chain-of-Thought in LLMs through Information Theory

Abstract:
Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short of accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.

Paperid:1276 poster

Authors:Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N. Balasubramanian, Amit Sharma

Title: Teaching Transformers Causal Reasoning through Axiomatic Training

Abstract:
For text-based AI systems to interact in the real world, causal reasoning is an essential skill. Since interventional data is costly to generate, we study to what extent an agent can learn causal reasoning from passive data. Specifically, we consider an axiomatic training setup where an agent learns from multiple demonstrations of a causal axiom (or rule), rather than incorporating the axiom as an inductive bias or inferring it from data values. A key question is whether the agent would learn to generalize from the axiom demonstrations to new scenarios. For example, if a transformer model is trained on demonstrations of the causal transitivity axiom over small graphs, would it generalize to applying the transitivity axiom over large graphs? Our results, based on a novel axiomatic training scheme, indicate that such generalization is possible. We consider the task of inferring whether a variable causes another variable, given a causal graph structure. We find that a 67 million parameter transformer model, when trained on linear causal chains (along with some noisy variations) can generalize well to new kinds of graphs, including longer causal chains, causal chains with reversed order, and graphs with branching; even when it is not explicitly trained for such settings. Our model performs at par (or even better) than many larger language models such as GPT-4, Gemini Pro, and Phi-3. Overall, our axiomatic training framework provides a new paradigm of learning causal reasoning from passive data that can be used to learn arbitrary axioms, as long as sufficient demonstrations can be generated.

Paperid:1277 poster

Authors:Narayan Srinivasan, Matthew Sutton, Christopher Drovandi, Leah F South

Title: The Polynomial Stein Discrepancy for Assessing Moment Convergence

Abstract:
We propose a novel method for measuring the discrepancy between a set of samples and a desired posterior distribution for Bayesian inference. Classical methods for assessing sample quality like the effective sample size are not appropriate for scalable Bayesian sampling algorithms, such as stochastic gradient Langevin dynamics, that are asymptotically biased. Instead, the gold standard is to use the kernel Stein Discrepancy (KSD), which is itself not scalable given its quadratic cost in the number of samples. The KSD and its faster extensions also typically suffer from the curse-of-dimensionality and can require extensive tuning. To address these limitations, we develop the polynomial Stein discrepancy (PSD) and an associated goodness-of-fit test. While the new test is not fully convergence-determining, we prove that it detects differences in the first $r$ moments for Gaussian targets. We empirically show that the test has higher power than its competitors in several examples, and at a lower computational cost. Finally, we demonstrate that the PSD can assist practitioners to select hyper-parameters of Bayesian sampling algorithms more efficiently than competitors.

Paperid:1278 poster

Authors:Yifan Fang, Yifei Fang, Ruizhe Chen, Haote Xu, Xinghao Ding, Yue Huang

Title: Demeaned Sparse: Efficient Anomaly Detection by Residual Estimate

Abstract:
Frequency-domain image anomaly detection methods can substantially enhance anomaly detection performance, however, they still lack an interpretable theoretical framework to guarantee the effectiveness of the detection process. We propose a novel test to detect anomalies in structural image via a Demeaned Fourier transform (DFT) under factor model framework, and we proof its effectiveness. We also briefly give the asymptotic theories of our test, the asymptotic theory explains why the test can detect anomalies at both the image and pixel levels within the theoretical lower bound. Based on our test, we derive a module called Demeaned Fourier Sparse (DFS) that effectively enhances detection performance in unsupervised anomaly detection tasks, which can construct masks in the Fourier domain and utilize a distribution-free sampling method similar to the bootstrap method. The experimental results indicate that this module can accurately and efficiently generate effective masks for reconstruction-based anomaly detection tasks, thereby enhancing the performance of anomaly detection methods and validating the effectiveness of the theoretical framework.

Paperid:1279 poster

Authors:Zeyu Fang, Ming Gu, Sheng Zhou, Jiawei Chen, Qiaoyu Tan, Haishuai Wang, Jiajun Bu

Title: Towards a Unified Framework of Clustering-based Anomaly Detection

Abstract:
Unsupervised Anomaly Detection (UAD) plays a crucial role in identifying abnormal patterns within data without labeled examples, holding significant practical implications across various domains. Although the individual contributions of representation learning and clustering to anomaly detection are well-established, their interdependencies remain under-explored due to the absence of a unified theoretical framework. Consequently, their collective potential to enhance anomaly detection performance remains largely untapped. To bridge this gap, in this paper, we propose a novel probabilistic mixture model for anomaly detection to establish a theoretical connection among representation learning, clustering, and anomaly detection. By maximizing a novel anomaly-aware data likelihood, representation learning and clustering can effectively reduce the adverse impact of anomalous data and collaboratively benefit anomaly detection. Meanwhile, a theoretically substantiated anomaly score is naturally derived from this framework. Lastly, drawing inspiration from gravitational analysis in physics, we have devised an improved anomaly score that more effectively harnesses the combined power of representation learning and clustering. Extensive experiments, involving 17 baseline methods across 30 diverse datasets, validate the effectiveness and generalization capability of the proposed method, surpassing state-of-the-art methods.

Paperid:1280 poster

Authors:Lutfi Eren Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

Title: Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Abstract:
Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.

Paperid:1281 poster

Authors:Dun Ma, Jianguo Chen, Wenguo Yang, Suixiang Gao, Shengminjie Chen

Title: Pruning for GNNs: Lower Complexity with Comparable Expressiveness

Abstract:
In recent years, the pursuit of higher expressive power in graph neural networks (GNNs) has often led to more complex aggregation mechanisms and deeper architectures. To address these issues, we have identified redundant structures in GNNs, and by pruning them, we propose Pruned MP-GNNs, K-Path GNNs, and K-Hop GNNs based on their original architectures. We show that 1) Although some structures are pruned in Pruned MP-GNNs and Pruned K-Path GNNs, their expressive power has not been compromised. 2) K-Hop MP-GNNs and their pruned architecture exhibit equivalent expressiveness on regular and strongly regular graphs. 3) The complexity of pruned K-Path GNNs and pruned K-Hop GNNs is lower than that of MP-GNNs, yet their expressive power is higher. Experimental results validate our refinements, demonstrating competitive performance across benchmark datasets with improved efficiency.

Paperid:1282 poster

Authors:Juncheng Dong, Moyang Guo, Ethan X Fang, Zhuoran Yang, Vahid Tarokh

Title: In-Context Reinforcement Learning From Suboptimal Historical Data

Abstract:
Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose theDecision Importance Transformer(DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.

Paperid:1283 poster

Authors:Allen Nie, Yi Su, Bo Chang, Jonathan Lee, Ed H. Chi, Quoc V Le, Minmin Chen

Title: EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration

Abstract:
Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.

Paperid:1284 poster

Authors:Xuying Ning, Dongqi Fu, Tianxin Wei, Wujiang Xu, Jingrui He

Title: Graph4MM: Weaving Multimodal Learning with Structural Information

Abstract:
Real-world multimodal data usually exhibit complex structural relationships beyond traditional one-to-one mappings like image-caption pairs. Entities across modalities interact in intricate ways, with images and text forming diverse interconnections through contextual dependencies and co-references. Graphs provide powerful structural information for modeling intra-modal and inter-modal relationships. However, previous works fail to distinguish multi-hop neighbors and treat the graph as a standalone modality, which fragments the overall understanding. This limitation presents two key challenges in multimodal learning: (1) integrating structural information from multi-hop neighbors into foundational models, and (2) fusing modality-specific information in a principled manner. To address these challenges, we revisit the role of graphs in multimodal learning within the era of foundation models and propose Graph4MM, a graph-based multimodal learning framework. To be specific, we introduce Hop-Diffused Attention, which integrates multi-hop structural information into self-attention through causal masking and hop diffusion. Furthermore, we design MM-QFormer, a multi-mapping querying transformer for cross-modal fusion. Through theoretical and empirical analysis, we show that leveraging structures to integrate both intra- and inter-modal interactions improves multimodal understanding beyond treating them as a standalone modality. Experiments on both generative and discriminative tasks show that Graph4MM outperforms larger VLMs, LLMs, and multimodal graph baselines, achieving a 6.93% average improvement.

Paperid:1285 poster

Authors:Yunbei Zhang, Akshay Mehra, Shuaicheng Niu, Jihun Hamm

Title: DPCore: Dynamic Prompt Coreset for Continual Test-Time Adaptation

Abstract:
Continual Test-Time Adaptation (CTTA) seeks to adapt source pre-trained models to continually changing, unseen target domains. While existing CTTA methods assume structured domain changes with uniform durations, real-world environments often exhibit dynamic patterns where domains recur with varying frequencies and durations. Current approaches, which adapt the same parameters across different domains, struggle in such dynamic conditions—they face convergence issues with brief domain exposures, risk forgetting previously learned knowledge, or misapplying it to irrelevant domains. To remedy this, we proposeDPCore, a method designed for robust performance across diverse domain change patterns while ensuring computational efficiency. DPCore integrates three key components: Visual Prompt Adaptation for efficient domain alignment, a Prompt Coreset for knowledge preservation, and a Dynamic Update mechanism that intelligently adjusts existing prompts for similar domains while creating new ones for substantially different domains. Extensive experiments on four benchmarks demonstrate that DPCore consistently outperforms various CTTA methods, achieving state-of-the-art performance in both structured and dynamic settings while reducing trainable parameters by 99% and computation time by 64% compared to previous approaches.

Paperid:1286 poster

Authors:I-Chun Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee

Title: Retraining-free Merging of Sparse MoE via Hierarchical Clustering

Abstract:
Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large- scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments. Our implementation is available athttps://github.com/wazenmai/HC-SMoE.

Paperid:1287 poster

Authors:Abhinav Agrawal, Justin Domke

Title: Understanding the difficulties of posterior predictive estimation

Abstract:
Predictive posterior densities (PPDs) are essential in approximate inference for quantifying predictive uncertainty and comparing inference methods. Typically, PPDs are estimated by simple Monte Carlo (MC) averages. In this paper, we expose a critical under-recognized issue: the signal-to-noise ratio (SNR) of the simple MC estimator can sometimes be extremely low, leading to unreliable estimates. Our main contribution is a theoretical analysis demonstrating that even with exact inference, SNR can decay rapidly with an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or (c) the size of test data relative to training data. Through several examples, we empirically verify these claims and show that these factors indeed lead to poor SNR and unreliable PPD estimates (sometimes, estimates are off by hundreds of nats even with a million samples). While not the primary focus, we also explore an adaptive importance sampling approach as an illustrative way to mitigate the problem, where we learn the proposal distribution by maximizing a variational proxy to the SNR. Taken together, our findings highlight an important challenge and provide essential insights for reliable estimation.

Paperid:1288 poster

Authors:Bart Bussmann, Noa Nabeshima, Adam Karvonen, Neel Nanda

Title: Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Abstract:
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.

Paperid:1289 poster

Authors:Manh Pham Hung, Aaqib Saeed, Dong Ma

Title: Boosting Masked ECG-Text Auto-Encoders as Discriminative Learners

Abstract:
The accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with accompanying textual reports further holds immense potential to enhance clinical diagnostics by combining physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust cross-modal learning. To address these obstacles, we propose D-BETA, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture, uniquely combining generative and boosted discriminative capabilities for robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that D-BETA significantly outperforms existing methods, achieving an average AUC improvement of 15% in linear probing with only one percent of training data and 2% in zero-shot performance without requiring training data over state-of-the-art models. These results highlight the effectiveness of D-BETA, underscoring its potential to advance automated clinical diagnostics through multi-modal representations.

Paperid:1290 poster

Authors:Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rares Andrei Ambrus, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov

Title: Understanding Complexity in VideoQA via Visual Program Generation

Abstract:
We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.

Paperid:1291 poster

Authors:Xuanming Cui, Chionh Wei Peng, Adriel Kuek, Ser-Nam Lim

Title: Improving Soft Unification with Knowledge Graph Embedding Methods

Abstract:
Neural Theorem Provers (NTPs) present a promising framework for neuro-symbolic reasoning, combining end-to-end differentiability with the interpretability of symbolic logic programming. However, optimizing NTPs remains a significant challenge due to their complex objective landscape and gradient sparcity. On the other hand, Knowledge Graph Embedding (KGE) methods offer smooth optimization with well-defined learning objectives but often lack interpretability. In this work, we propose several strategies to integrate the strengths of NTPs and KGEs, and demonstrate substantial improvements in both accuracy and computational efficiency. Specifically, we show that by leveraging the strength of structural learning in KGEs, we can greatly improve NTPs' poorly structured embedding space, while by substituting NTPs with efficient KGE operations, we can significantly reduce evaluation time by over 1000×on large-scale dataset such as WN18RR with a mild accuracy trade-off.

Paperid:1292 poster

Authors:Zizheng Huang, Haoxing Chen, Jiaqi Li, jun lan, Huijia Zhu, Weiqiang Wang, Limin Wang

Title: Stochastic Layer-Wise Shuffle for Improving Vision Mamba Training

Abstract:
Recent Vision Mamba (Vim) models exhibit nearly linear complexity in sequence length, making them highly attractive for processing visual data. However, the training methodologies and their potential are still not sufficiently explored. In this paper, we investigate strategies for Vim and propose Stochastic Layer-Wise Shuffle (SLWS), a novel regularization method that can effectively improve the Vim training. Without architectural modifications, this approach enables the non-hierarchical Vim to get leading performance on ImageNet-1K compared with the similar type counterparts. Our method operates through four simple steps per layer: probability allocation to assign layer-dependent shuffle rates, operation sampling via Bernoulli trials, sequence shuffling of input tokens, and order restoration of outputs. SLWS distinguishes itself through three principles: \textit{(1) Plug-and-play:} No architectural modifications are needed, and it is deactivated during inference. \textit{(2) Simple but effective:} The four-step process introduces only random permutations and negligible overhead. \textit{(3) Intuitive design:} Shuffling probabilities grow linearly with layer depth, aligning with the hierarchical semantic abstraction in vision models. Our work underscores the importance of tailored training strategies for Vim models and provides a helpful way to explore their scalability. Code and models are available athttps://github.com/huangzizheng01/ShuffleMamba

Paperid:1293 poster

Authors:Nona Rajabi, Antonio H. Ribeiro, Miguel Vasco, Farzaneh Taleb, Mårten Björkman, Danica Kragic

Title: Human-Aligned Image Models Improve Visual Decoding from the Brain

Abstract:
Decoding visual images from brain activity has significant potential for advancing brain-computer interaction and enhancing the understanding of human perception. Recent approaches align the representation spaces of images and brain activity to enable visual decoding. In this paper, we introduce the use of human-aligned image encoders to map brain signals to images. We hypothesize that these models more effectively capture perceptual attributes associated with the rapid visual stimuli presentations commonly used in visual brain data recording experiments. Our empirical results support this hypothesis, demonstrating that this simple modification improves image retrieval accuracy by up to 21% compared to state-of-the-art methods. Comprehensive experiments confirm consistent performance improvements across diverse EEG architectures, image encoders, alignment methods, participants, and brain imaging modalities.

Paperid:1294 poster

Authors:Alberto Bernacchia

Title: Global curvature for second-order optimization of neural networks

Abstract:
Second-order optimization methods, which leverage the local curvature of the loss function, have the potential to dramatically accelerate the training of machine learning models. However, these methods are often hindered by the computational burden of constructing and inverting large curvature matrices withO(p2)elements, wherepis the number of parameters. In this work, we present a theory that predicts the \emph{exact} structure of the global curvature by leveraging the intrinsic symmetries of neural networks, such as invariance under parameter permutations. For Multi-Layer Perceptrons (MLPs), our approach reveals that the global curvature can be expressed in terms ofO(d2+L2)independent factors, wheredis the number of input/output dimensions andLis the number of layers, significantly reducing the computational burden compared to theO(p2)elements of the full matrix. These factors can be estimated efficiently, enabling precise curvature computations. To evaluate the practical implications of our framework, we apply second-order optimization to synthetic data, achieving markedly faster convergence compared to traditional optimization methods. Our findings pave the way for a better understanding of the loss landscape of neural networks, and for designing more efficient training methodologies in deep learning. Code: \href{https://github.com/mtkresearch/symo_notebooks}{github.com/mtkresearch/symo\_notebooks}

Paperid:1295 poster

Authors:Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong

Title: Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Abstract:
Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis data collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes athttps://aguvis-project.github.ioto advance future research.

Paperid:1296 poster

Authors:Hanmo Liu, Shimin Di, Haoyang LI, Xun Jian, Yue Wang, Lei Chen

Title: A Selective Learning Method for Temporal Graph Continual Learning

Abstract:
Node classification is a key task in temporal graph learning (TGL). Real-life temporal graphs often introduce new node classes over time, but existing TGL methods assume a fixed set of classes. This assumption brings limitations, as updating models with full data is costly, while focusing only on new classes results in forgetting old ones. Graph continual learning (GCL) methods mitigate forgetting using old-class subsets but fail to account for their evolution. We define this novel problem as temporal graph continual learning (TGCL), which focuses on efficiently maintaining up-to-date knowledge of old classes. To tackle TGCL, we propose a selective learning framework that substitutes the old-class data with its subsets, Learning Towards the Future (LTF). We derive an upper bound on the error caused by such replacement and transform it into objectives for selecting and learning subsets that minimize classification error while preserving the distribution of the full old-class data. Experiments on three real-world datasets show that LTF effectively addresses the TGCL challenge.

Paperid:1297 poster

Authors:Zitao Wang, Ziyuan Wang, Molei Liu, Nian Si

Title: Knowledge-Guided Wasserstein Distributionally Robust Optimization

Abstract:
Wasserstein Distributionally Robust Optimization (WDRO) is a principled framework for robust estimation under distributional uncertainty. However, its standard formulation can be overly conservative, particularly in small-sample regimes. We propose a novel knowledge-guided WDRO (KG-WDRO) framework for transfer learning, which adaptively incorporates multiple sources of external knowledge to improve generalization accuracy. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.

Paperid:1298 poster

Authors:Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V. Nori, Rahul Sharma, Amit Sharma, Javier Gonzalez

Title: RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

Abstract:
Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true “reasoning�?or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE: a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

Paperid:1299 poster

Authors:Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang

Title: Diffusion Adversarial Post-Training for One-Step Video Generation

Abstract:
The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model can generate two-second, 1280x720, 24fps videos in real-time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.

Paperid:1300 poster

Authors:Haruka Kiyohara, Fan Yao, Sarah Dean

Title: Policy Design for Two-sided Platforms with Participation Dynamics

Abstract:
In two-sided platforms (e.g., video streaming or e-commerce), viewers and providers engage in interactive dynamics: viewers benefit from increases in provider populations, while providers benefit from increases in viewer population. Despite the importance of such “population effects�?on long-term platform health, recommendation policies do not generally take the participation dynamics into account. This paper thus studies the dynamics and recommender policy design on two-sided platforms under the population effects for the first time. Our control- and game-theoretic findings warn against the use of the standard “myopic-greedy�?policy and shed light on the importance of provider-side considerations (i.e., effectively distributing exposure among provider groups) to improve social welfare via population growth. We also present a simple algorithm to optimize long-term social welfare by taking the population effects into account, and demonstrate its effectiveness in synthetic and real-data experiments. Our experiment code is available athttps://github.com/sdean-group/dynamics-two-sided-market.

Paperid:1301 poster

Authors:Mao-Lin Luo, Zi-Hao Zhou, Tong Wei, Min-Ling Zhang

Title: LADA: Scalable Label-Specific CLIP Adapter for Continual Learning

Abstract:
Continual learning with vision-language models like CLIP offers a pathway toward scalable machine learning systems by leveraging its transferable representations. Existing CLIP-based methods adapt the pre-trained image encoder by adding multiple sets of learnable parameters, with each task using a partial set of parameters. This requires selecting the expected parameters for input images during inference, which is prone to error that degrades performance. To address this problem, we introduce LADA (Label-specificADApter). Instead of partitioning parameters across tasks, LADA appends lightweight, label-specific memory units to the frozen CLIP image encoder, enabling discriminative feature generation by aggregating task-agnostic knowledge. To prevent catastrophic forgetting, LADA employs feature distillation for seen classes, preventing their features from being interfered with by new classes. Positioned after the image encoder, LADA prevents gradient flow to the frozen CLIP parameters, ensuring efficient training. Extensive results show that LADA achieves state-of-the-art performance in continual learning settings. The implementation code is available athttps://github.com/MaolinLuo/LADA.

Paperid:1302 poster

Authors:Wenjie Wu, Dexuan Huo, Hong Chen

Title: SpikF: Spiking Fourier Network for Efficient Long-term Prediction

Abstract:
Spiking Neural Networks (SNNs) have demonstrated remarkable potential across many domains, including computer vision and natural language processing, owing to their energy efficiency and biological plausibility. However, their application in long-term prediction tasks remains underexplored, which is primarily due to two critical challenges: (1) current SNN encoding methods are unable to effectively encode long temporal information, leading to increased computational complexity and energy consumption; (2) though Transformer-based models have achieved state-of-the-art accuracy in temporal prediction tasks, the absence of proper positional encoding for spiking self-attention restricts Spiking Transformer from effectively utilizing positional information, resulting in performance degradation. To address these challenges, we introduce an attention-free framework,SpikingFourier Network (SpikF), that encodes input sequences in patches and employs an innovative frequency domain selection mechanism to effectively utilize the sequential properties of time-series data. Extensive evaluations on eight well-established long-term prediction datasets demonstrate that SpikF achieves an averaged1.9%reduction in Mean Absolute Error (MAE) compared to state-of-the-art models, while lowering total energy consumption by3.16×. Our code is available athttps://github.com/WWJ-creator/SpikF.

Paperid:1303 poster

Authors:Doron Cohen, Aryeh Kontorovich, Roi Weiss

Title: The Empirical Mean is Minimax Optimal for Local Glivenko-Cantelli

Abstract:
We revisit the recently introduced Local Glivenko-Cantelli setting, which studies distribution-dependent uniform convergence rates of the Empirical Mean Estimator (EME). In this work, we investigate generalizations of this setting where arbitrary estimators are allowed rather than just the EME. Can a strictly larger class of measures be learned? Can better risk decay rates be obtained? We provide exhaustive answers to these questions—which are both negative, provided the learner is barred from exploiting some infinite-dimensional pathologies. On the other hand, allowing such exploits does lead to a strictly larger class of learnable measures.

Paperid:1304 poster

Authors:Zhongnian Li, Jinghao Xu, Peng Ying, Meng Wei, Xinzheng Xu

Title: Learning from True-False Labels via Multi-modal Prompt Retrieving

Abstract:
Pre-trainedVision-LanguageModels (VLMs) exhibit strong zero-shot classification abilities, demonstrating great potential for generating weakly supervised labels. Unfortunately, existing weakly supervised learning methods are short of ability in generating accurate labels via VLMs. In this paper, we propose a novel weakly supervised labeling setting, namelyTrue-FalseLabels (TFLs) which can achieve high accuracy when generated by VLMs. The TFL indicates whether an instance belongs to the label, which is randomly and uniformly sampled from the candidate label set. Specifically, we theoretically derive a risk-consistent estimator to explore and utilize the conditional probability distribution information of TFLs. Besides, we propose a convolutional-basedMulti-modalPromptRetrieving (MRP) method to bridge the gap between the knowledge of VLMs and target learning tasks. Experimental results demonstrate the effectiveness of the proposed TFL setting and MRP learning method. The code to reproduce the experiments is athttps://github.com/Tranquilxu/TMP.

Paperid:1305 poster

Authors:Zongqian Wu, Baoduo Xu, Ruochen Cui, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng

Title: Rethinking Chain-of-Thought from the Perspective of Self-Training

Abstract:
Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in LLMs. Interestingly, we observe that both CoT reasoning and self-training share the core objective: iteratively leveraging model-generated information to progressively reduce prediction uncertainty. Building on this insight, we propose a novel CoT framework to improve reasoning performance. Our framework integrates two key components: (i) a task-specific prompt module that optimizes the initial reasoning process, and (ii) an adaptive reasoning iteration module that dynamically refines the reasoning process and addresses the limitations of previous CoT approaches, i.e., over-reasoning and high similarity between consecutive reasoning iterations. Extensive experiments show that the proposed method achieves significant advantages in both performance and computational efficiency. Our code is available at:https://github.com/zongqianwu/ST-COT.

Paperid:1306 poster

Authors:ZITIAN LI, Wang Chi Cheung

Title: Near Optimal Non-asymptotic Sample Complexity of 1-Identification

Abstract:
Motivated by an open direction in existing literature, we study the 1-identification problem, a fundamental multi-armed bandit formulation on pure exploration. The goal is to determine whether there exists an arm whose mean reward is at least a known thresholdμ0, or to output \textsf{None} if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least1−�? Degenne & Koolen 2019 has established the asymptotically tight sample complexity for the 1-identification problem, but they commented that the non-asymptotic analysis remains unclear. We design a new algorithm Sequential-Exploration-Exploitation (SEE), and conduct theoretical analysis from the non-asymptotic perspective. Novel to the literature, we achieve near optimality, in the sense of matching upper and lower bounds on the pulling complexity. The gap between the upper and lower bounds is up to a polynomial logarithmic factor. The numerical result also indicates the effectiveness of our algorithm, compared to existing benchmarks.

Paperid:1307 poster

Authors:Quansong He, Xiangde Min, Kaishen Wang, Tao He

Title: FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks

Abstract:
Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available athttps://github.com/nayutayuki/FuseUNet.

Paperid:1308 poster

Authors:Zhengqi Gao, Kaiwen Zha, Tianyuan Zhang, Zihui Xue, Duane S Boning

Title: REG: Rectified Gradient Guidance for Conditional Diffusion Models

Abstract:
Guidance techniques are simple yet effective for improving conditional generation in diffusion models. Albeit their empirical success, the practical implementation of guidance diverges significantly from its theoretical motivation. In this paper, we reconcile this discrepancy by replacing the scaled marginal distribution target, which we prove theoretically invalid, with a valid scaled joint distribution objective. Additionally, we show that the established guidance implementations are approximations to the intractable optimal solution under no future foresight constraint. Building on these theoretical insights, we propose rectified gradient guidance (REG), a versatile enhancement designed to boost the performance of existing guidance methods. Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional ImageNet and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence.

Paperid:1309 poster

Authors:Kai Müller, Martin Wenzel, Tobias Windisch

Title: LineFlow: A Framework to Learn Active Control of Production Lines

Abstract:
Many production lines require active control mechanisms, such as adaptive routing, worker reallocation, and rescheduling, to maintain optimal performance. However, designing these control systems is challenging for various reasons, and while reinforcement learning (RL) has shown promise in addressing these challenges, a standardized and general framework is still lacking. In this work, we introduce LineFlow, an extensible, open-source Python framework for simulating production lines of arbitrary complexity and training RL agents to control them. To demonstrate the capabilities and to validate the underlying theoretical assumptions of LineFlow, we formulate core subproblems of active line control in ways that facilitate mathematical analysis. For each problem, we provide optimal solutions for comparison. We benchmark state-of-the-art RL algorithms and show that the learned policies approach optimal performance in well-understood scenarios. However, for more complex, industrial-scale production lines, RL still faces significant challenges, highlighting the need for further research in areas such as reward shaping, curriculum learning, and hierarchical control.

Paperid:1310 poster

Authors:Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, zhou Xun

Title: Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Abstract:
Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Paperid:1311 poster

Authors:Yalan Qin, Guorui Feng, Xinpeng Zhang

Title: Fast Incomplete Multi-view Clustering by Flexible Anchor Learning

Abstract:
Multi-view clustering aims to improve the final performance by taking advantages of complementary and consistent information of all views. In real world, data samples with partially available information are common and the issue regarding the clustering for incomplete multi-view data is inevitably raised. To deal with the partial data with large scales, some fast clustering approaches for incomplete multi-view data have been presented. Despite the significant success, few of these methods pay attention to learning anchors with high quality in a unified framework for incomplete multi-view clustering, while ensuring the scalability for large-scale incomplete datasets. In addition, most existing approaches based on incomplete multi-view clustering ignore to build the relation between anchor graph and similarity matrix in symmetric nonnegative matrix factorization and then directly conduct graph partition based on the anchor graph to reduce the space and time consumption. In this paper, we propose a novel fast incomplete multi-view clustering method for the data with large scales, termed Fast Incomplete Multi-view clustering by flexible anchor Learning (FIML), where graph construction, anchor learning and graph partition are simultaneously integrated into a unified framework for fast incomplete multi-view clustering. To be specific, we learn a shared anchor graph to guarantee the consistency among multiple views and employ a adaptive weight coefficient to balance the impact for each view. The relation between anchor graph and similarity matrix in symmetric nonnegative matrix factorization can also be built, i.e., each entry in the anchor graph can characterize the similarity between the anchor and original data sample. We then adopt an alternative algorithm for solving the formulated problem. Experiments conducted on different datasets confirm the superiority of FIML compared with other clustering methods for incomplete multi-view data.

Paperid:1312 poster

Authors:Mengmeng Ma, Tang Li, Yunxiang Peng, Lu Lin, Volkan Beylergil, Binsheng Zhao, Oguz Akin, Xi Peng

Title: "Why Is There a Tumor?": Tell Me the Reason, Show Me the Evidence

Abstract:
Medical AI models excel at tumor detection and segmentation. However, their latent representations often lack explicit ties to clinical semantics, producing outputs less trusted in clinical practice. Most of the existing models generate either segmentation masks/labels (localizing where without why) or textual justifications (explaining why without where), failing to ground clinical concepts in spatially localized evidence. To bridge this gap, we propose to develop models that can justify the segmentation or detection using clinically relevant terms and point to visual evidence. We address two core challenges: First, we curate a rationale dataset to tackle the lack of paired images, annotations, and textual rationales for training. The dataset includes 180K image-mask-rationale triples with quality evaluated by expert radiologists. Second, we design rationale-informed optimization that disentangles and localizes fine-grained clinical concepts in a self-supervised manner without requiring pixel-level concept annotations. Experiments across medical benchmarks show our model demonstrates superior performance in segmentation, detection, and beyond. The anonymous link to our code.

Paperid:1313 poster

Authors:Yue Meng, Chuchu Fan

Title: TeLoGraF: Temporal Logic Planning via Graph-encoded Flow Matching

Abstract:
Learning to solve complex tasks with signal temporal logic (STL) specifications is crucial to many real-world applications. However, most previous works only consider fixed or parametrized STL specifications due to the lack of a diverse STL dataset and encoders to effectively extract temporal logic information for downstream tasks. In this paper, we propose TeLoGraF, Temporal Logic Graph-encoded Flow, which utilizes Graph Neural Networks (GNN) encoder and flow-matching to learn solutions for general STL specifications. We identify four commonly used STL templates and collect a total of 200K specifications with paired demonstrations. We conduct extensive experiments in five simulation environments ranging from simple dynamical models in the 2D space to high-dimensional 7DoF Franka Panda robot arm and Ant quadruped navigation. Results show that our method outperforms other baselines in the STL satisfaction rate. Compared to classical STL planning algorithms, our approach is 10-100X faster in inference and can work on any system dynamics. Besides, we show our graph-encoding method's capability to solve complex STLs and robustness to out-distribution STL specifications. Code is available athttps://github.com/mengyuest/TeLoGraF

Paperid:1314 poster

Authors:Suyuan Zhao, YIZHEN LUO, Ganbo Yang, Yan Zhong, Hao Zhou, Zaiqing Nie

Title: SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics

Abstract:
Spatial Transcriptomics (ST) technologies provide biologists with rich insights into single-cell biology by preserving spatial context of cells. Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi-scale information from tissue slices containing vast numbers of cells. This process requires integrating macro-scale tissue morphology, micro-scale cellular microenvironment, and gene-scale gene expression profile. To address this challenge, we proposeSToFM, a multi-scaleSpatialTranscriptomicsFoundationModel. SToFM first performs multi-scale information extraction on each ST slice, to construct a set of ST sub-slices that aggregate macro-, micro- and gene-scale information. Then an SE(2) Transformer is used to obtain high-quality cell representations from the sub-slices. Additionally, we constructSToCorpus-88M, the largest high-resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data through capturing and integrating multi-scale information.

Paperid:1315 poster

Authors:Zichen Liu, Xu Zou, Gang Hua, Jiahuan Zhou

Title: Token Coordinated Prompt Attention is Needed for Visual Prompting

Abstract:
Visual prompting techniques are widely used to efficiently fine-tune pretrained Vision Transformers (ViT) by learning a small set of shared prompts for all tokens. However, existing methods overlook the unique roles of different tokens in conveying discriminative information and interact with all tokens using the same prompts, thereby limiting the representational capacity of ViT. This often leads to indistinguishable and biased prompt-extracted features, hindering performance. To address this issue, we propose a plug-and-play Token Coordinated Prompt Attention (TCPA) module, which assigns specific coordinated prompts to different tokens for attention-based interactions. Firstly, recognizing the distinct functions of CLS and image tokens-global information aggregation and local feature extraction, we disentangle the prompts into CLS Prompts and Image Prompts, which interact exclusively with CLS tokens and image tokens through attention mechanisms. This enhances their respective discriminative abilities. Furthermore, as different image tokens correspond to distinct image patches and contain diverse information, we employ a matching function to automatically assign coordinated prompts to individual tokens. This enables more precise attention interactions, improving the diversity and representational capacity of the extracted features. Extensive experiments across various benchmarks demonstrate that TCPA significantly enhances the diversity and discriminative power of the extracted features.

Paperid:1316 poster

Authors:Fengchun Qiao, Yanlin Chen, Xi Peng

Title: Structure-informed Risk Minimization for Robust Ensemble Learning

Abstract:
Ensemble learning is a powerful approach for improving generalization under distribution shifts, but its effectiveness heavily depends on how individual models are combined. Existing methods often optimize ensemble weights based on validation data, which may not represent unseen test distributions, leading to suboptimal performance in out-of-distribution (OoD) settings. Inspired by Distributionally Robust Optimization (DRO), we propose Structure-informed Risk Minimization (SRM), a principled framework that learns robust ensemble weights without access to test data. Unlike standard DRO, which defines uncertainty sets based on divergence metrics alone, SRM incorporates structural information of training distributions, ensuring that the uncertainty set aligns with plausible real-world shifts. This approach mitigates the over-pessimism of traditional worst-case optimization while maintaining robustness. We introduce a computationally efficient optimization algorithm with theoretical guarantees and demonstrate that SRM achieves superior OoD generalization compared to existing ensemble combination strategies across diverse benchmarks. Code is available at:https://github.com/deep-real/SRM.

Paperid:1317 poster

Authors:Ansgar Lößer, Max Schlecht, Florian Schintke, Joel Witzke, Matthias Weidlich, Björn Scheuermann

Title: Fast Min-ϵSegmented Regression using Constant-Time Segment Merging

Abstract:
Segmented regression is a statistical method that approximates a functionfby a piecewise functionf^using noisy data samples.Min-ϵapproaches aim to reduce the regression function's mean squared error (MSE) for a given number ofksegments. An optimal solution formin-ϵsegmented regression is found inO(n2)time (Bai & Perron, 1998; Yamamoto & Perron, 2013) fornsamples. For large datasets, current heuristics improve time complexity toO(nlog⁡n)(Acharya et al., 2016) but can result in large errors, especially when exactlyksegments are used. We present a method formin-ϵsegmented regression that combines the scalability of top existing heuristic solutions with a statistical efficiency similar to the optimal solution. This is achieved by using a new method to merge an initial set of segments using precomputed matrices from samples, allowing both merging and error calculation in constant time. Our approach, using the same samples and parameterk, produces segments with up to 1,000 times lower MSE compared to Acharya et al. (2016) in about 100 times less runtime on data sets over104samples.

Paperid:1318 poster

Authors:Jacopo Graldi, Alessandro Breccia, Giulia Lanzillotta, Thomas Hofmann, Lorenzo Noci

Title: The Importance of Being Lazy: Scaling Limits of Continual Learning

Abstract:
Despite recent efforts, neural networks still struggle to learn in non-stationary environments, and our understanding of catastrophic forgetting (CF) is far from complete. In this work, we perform a systematic study on the impact of model scale and the degree of feature learning in continual learning. We reconcile existing contradictory observations on scale in the literature, by differentiating betweenlazyandrichtraining regimes through a variable parameterization of the architecture. We show that increasing model width is only beneficial when it reduces the amount offeature learning, yielding more laziness. Using the framework of dynamical mean field theory, we then study the infinite width dynamics of the model in the feature learning regime and characterize CF, extending prior theoretical results limited to the lazy regime. We study the intricate relationship between feature learning, task non-stationarity, and forgetting, finding that high feature learning is only beneficial with highly similar tasks. We identify a transition modulated by task similarity where the model exits an effectively lazy regime with low forgetting to enter a rich regime with significant forgetting. Finally, our findings reveal that neural networks achieve optimal performance at a critical level of feature learning, which depends on task non-stationarity andtransfers across model scales. This work provides a unified perspective on the role of scale and feature learning in continual learning.

Paperid:1319 poster

Authors:Muhammed Goktepe, Amir hossein Shamseddin, Erencan Uysal, Javier Muinelo Monteagudo, Lukas Drees, Aysim Toker, Senthold Asseng, Malte von Bloh

Title: EcoMapper: Generative Modeling for Climate-Aware Satellite Imagery

Abstract:
Satellite imagery is essential for Earth observation, enabling applications like crop yield prediction, environmental monitoring, and climate change assessment. However, integrating satellite imagery with climate data remains a challenge, limiting its utility for forecasting and scenario analysis. We introduce a novel dataset of 2.9 million Sentinel-2 images spanning 15 land cover types with corresponding climate records, forming the foundation for two satellite image generation approaches using fine-tuned Stable Diffusion 3 models. The first is a text-to-image generation model that uses textual prompts with climate and land cover details to produce realistic synthetic imagery for specific regions. The second leverages ControlNet for multi-conditional image generation, preserving spatial structures while mapping climate data or generating time-series to simulate landscape evolution. By combining synthetic image generation with climate and land cover data, our work advances generative modeling in remote sensing, offering realistic inputs for environmental forecasting and new possibilities for climate adaptation and geospatial analysis.

Paperid:1320 poster

Authors:Naoki Nishikawa, Yujin Song, Kazusato Oko, Denny Wu, Taiji Suzuki

Title: Nonlinear transformers can perform inference-time feature learning

Abstract:
Pretrained transformers have demonstrated the ability to implement various algorithms at inference time without parameter updates. While theoretical works have established this capability through constructions and approximation guarantees, the optimization and statistical efficiency aspects remain understudied. In this work, we investigate how transformers learn features in-context -- a key mechanism underlying their inference-time adaptivity. We focus on the in-context learning of single-index models $y=\sigma_*(\langle \boldsymbol{x},\boldsymbol{\beta}\rangle)$, which are low-dimensional nonlinear functions parameterized by feature vector $\boldsymbol\beta$. We prove that transformers pretrained by gradient-based optimization can performinference-time feature learning, i.e., extract information of the target features $\boldsymbol{\beta}$ solely from test prompts (despite $\boldsymbol {\beta}$ varying across different prompts), hence achieving an in-context statistical efficiency that surpasses any non-adaptive (fixed-basis) algorithms such as kernel methods. Moreover, we show that the inference-time sample complexity surpasses the Correlational Statistical Query (CSQ) lower bound, owing to nonlinear label transformations naturally induced by the Softmax self-attention mechanism.

Paperid:1321 poster

Authors:Gaspard Lambrechts, Damien Ernst, Aditya Mahajan

Title: A Theoretical Justification for Asymmetric Actor-Critic Algorithms

Abstract:
In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

Paperid:1322 poster

Authors:Xingyu Miao, Haoran Duan, Yang Long, Jungong Han

Title: Rethinking Score Distilling Sampling for 3D Editing and Generation

Abstract:
Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing.

Paperid:1323 poster

Authors:Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Qingfu Zhang, Hau-San Wong, Taiji Suzuki

Title: Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Abstract:
In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latenttask/function vectorin LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically groundedhierarchicalconcept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.

Paperid:1324 poster

Authors:Zhihai Wang, Jie Wang, Jilai Pan, Xilin Xia, Huiling Zhen, Mingxuan Yuan, Jianye HAO, Feng Wu

Title: Accelerating Large Language Model Reasoning via Speculative Search

Abstract:
Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12$\times$ speedup with comparable reasoning quality.

Paperid:1325 poster

Authors:Jennifer Hsia, Afreen Shaikh, Zora Zhiruo Wang, Graham Neubig

Title: RAGGED: Towards Informed Design of Scalable and Stable RAG Systems

Abstract:
Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.

Paperid:1326 poster

Authors:Evan Sidrow, Alexandre Bouchard-Cote, Lloyd T Elliott

Title: Variational Phylogenetic Inference with Products over Bipartitions

Abstract:
Bayesian phylogenetics is vital for understanding evolutionary dynamics, and requires accurate and efficient approximation of posterior distributions over trees. In this work, we develop a variational Bayesian approach for ultrametric phylogenetic trees. We present a novel variational family based on coalescent times of a single-linkage clustering and derive a closed-form density for the resulting distribution over trees. Unlike existing methods for ultrametric trees, our method performs inference over all of tree space, it does not require any Markov chain Monte Carlo subroutines, and our variational family is differentiable. Through experiments on benchmark genomic datasets and an application to the viral RNA of SARS-CoV-2, we demonstrate that our method achieves competitive accuracy while requiring significantly fewer gradient evaluations than existing state-of-the-art techniques.

Paperid:1327 poster

Authors:Harry Mead, Clarissa Costen, Bruno Lacerda, Nick Hawes

Title: Return Capping: Sample Efficient CVaR Policy Gradient Optimisation

Abstract:
When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: \url{https://github.com/HarryMJMead/cvar-return-capping}.

Paperid:1328 poster

Authors:Mike Heddes, Adel Javanmard, Kyriakos Axiotis, Gang Fu, Mohammadhossein Bateni, Vahab Mirrokni

Title: DeepCrossAttention: Supercharging Transformer Residual Connections

Abstract:
Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible number of parameters (e.g., 0.2%). Theoretical analysis confirms that DCA provides an improved trade-off between accuracy and model size when the ratio of collective layer ranks to the ambient dimension falls below a critical threshold.

Paperid:1329 poster

Authors:Xiangxin Zhou, Mingyu Li, Yi Xiao, Jiahan Li, Dongyu Xue, Zaixiang Zheng, Jianzhu Ma, Quanquan Gu

Title: Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling

Abstract:
Cyclic peptides offer inherent advantages in pharmaceuticals. For example, cyclic peptides are more resistant to enzymatic hydrolysis compared to linear peptides and usually exhibit excellent stability and affinity. Although deep generative models have achieved great success in linear peptide design, several challenges prevent the development of computational methods for designing diverse types of cyclic peptides. These challenges include the scarcity of 3D structural data on target proteins and associated cyclic peptide ligands, the geometric constraints that cyclization imposes, and the involvement of non-canonical amino acids in cyclization. To address the above challenges, we introduce CpSDE, which consists of two key components: AtomSDE, a generative structure prediction model based on harmonic SDE, and ResRouter, a residue type predictor. Utilizing a routed sampling algorithm that alternates between these two models to iteratively update sequences and structures, CpSDE facilitates the generation of cyclic peptides. By employing explicit all-atom and bond modeling, CpSDE overcomes existing data limitations and is proficient in designing a wide variety of cyclic peptides. Our experimental results demonstrate that the cyclic peptides designed by our method exhibit reliable stability and affinity.

Paperid:1330 poster

Authors:Casper Gyurik, Riccardo Molteni, Vedran Dunjko

Title: Limitations of measure-first protocols in quantum machine learning

Abstract:
In recent times, there have been major developments in two distinct yet connected domains of quantum information. On the one hand, substantial progress has been made in so-called randomized measurement protocols. Here, a number of properties of unknown quantum states can be deduced from surprisingly few measurement outcomes, using schemes such as classical shadows. On the other hand, significant progress has been made in quantum machine learning. For example, exponential advantages have been proven when the data consists of quantum states and quantum algorithms can coherently measure multiple copies of input states. In this work, we aim to understand the implications and limitations of combining randomized measurement protocols with quantum machine learning, although the implications are broader. Specifically, we investigate quantum machine learning algorithms that, when dealing with quantum data, can either process it entirely using quantum methods or measure the input data through a fixed measurement scheme and utilize the resulting classical information. We prove limitations for the general class of quantum machine learning algorithms that use fixed measurement schemes on the input quantum states. Our results have several implications. From the perspective of randomized measurement procedures, we show limitations of measure-first protocols in the average case, improving on the state-of-the-art which only focuses on worst-case scenarios. Additionally, previous lower bounds were only known for physically unrealizable states. We improve upon this by employing quantum pseudorandom functions to prove that a learning separation also exists when dealing with physically realizable states, which may be encountered in experiments. From a machine learning perspective, our results are crucial for defining a physically meaningful task that shows fully quantum machine learning processing is not only more efficient but also necessary for solving certain problems. The tasks at hand are also realistic, as the algorithms and proven separations hold when working with efficiently preparable states and remain robust in the presence of measurement and preparation errors.

Paperid:1331 poster

Authors:Quentin Fruytier, Aryan Mokhtari, Sujay Sanghavi

Title: Learning Mixtures of Experts with EM: A Mirror Descent Perspective

Abstract:
Classical Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoE-based model architectures have become popular as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent-type methods on the log-likelihood. In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for MoE where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of 2 linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.

Paperid:1332 poster

Authors:Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, Andreas Krause

Title: Provable Maximum Entropy Manifold Exploration via Diffusion Models

Abstract:
Exploration is critical for solving real-world decision-making problems such as scientific discovery, where the objective is to generate truly novel designs rather than mimic existing data distributions. In this work, we address the challenge of leveraging the representational power of generative models for exploration without relying on explicit uncertainty quantification. We introduce a novel framework that casts exploration as entropy maximization over the approximate data manifold implicitly defined by a pre-trained diffusion model. Then, we present a novel principle for exploration based on density estimation, a problem well-known to be challenging in practice. To overcome this issue and render this method truly scalable, we leverage a fundamental connection between the entropy of the density induced by a diffusion model and its score function. Building on this, we develop an algorithm based on mirror descent that solves the exploration problem as sequential fine-tuning of a pre-trained diffusion model. We prove its convergence to the optimal exploratory diffusion model under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we empirically evaluate our approach on both synthetic and high-dimensional text-to-image diffusion, demonstrating promising results.

Paperid:1333 poster

Authors:Omer Ronen, Ahmed Imtiaz Humayun, Richard Baraniuk, Randall Balestriero, Bin Yu

Title: MITIGATING OVER-EXPLORATION IN LATENT SPACE OPTIMIZATION USING LES

Abstract:
We develop Latent Exploration Score (LES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. LES leverages the trained decoder’s approximation of the data distribution, and can be employed with any VAE decoder–including pretrained ones–without additional training, architectural changes or access to the training data. Our evaluation across five LSO benchmark tasks and twenty-two VAE models demonstrates that LES always enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions in most cases. We believe that new avenues to LSO will be opened by LES�?ability to identify out of distribution areas, differentiability, and computational tractability.

Paperid:1334 poster

Authors:Biao Liu, Ning Xu, Xin Geng

Title: Progressively Label Enhancement for Large Language Model Alignment

Abstract:
Large Language Models (LLM) alignment aims to prevent models from producing content that misaligns with human expectations, which can lead to ethical and legal concerns. In the last few years, Reinforcement Learning from Human Feedback (RLHF) has been the most prominent method for achieving alignment. Due to challenges in stability and scalability with RLHF stages, which arise from the complex interactions between multiple models, researchers are exploring alternative methods to achieve effects comparable to those of RLHF. However, these methods often rely on large high-quality datasets. Despite some methods considering the generation of additional data to expand datasets, they often treat model training and data generation as separate and static processes, overlooking the fact that these processes are highly interdependent, leading to inefficient utilization of the generated data. To deal with this problem, we propose PLE, i.e., Progressively Label Enhancement for LLM Alignment, a framework that dynamically adjusts the model’s training process based on the evolving quality of the generated data. Specifically, we prompt the model to generate responses for both the original query and a set of carefully designed principle guided query, and then utilize a dynamic threshold to determine the appropriate training approach for both responses based on their corresponding reward scores. Experimental results demonstrate the effectiveness of PLE compared to existing LLM alignment methods.

Paperid:1335 poster

Authors:Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

Title: Efficient LiDAR Reflectance Compression via Scanning Serialization

Abstract:
Reflectance attributes in LiDAR point clouds provide essential information for downstream tasks but remain underexplored in neural compression methods. To address this, we introduce SerLiC, a serialization-based neural compression framework to fully exploit the intrinsic characteristics of LiDAR reflectance. SerLiC first transforms 3D LiDAR point clouds into 1D sequences via scan-order serialization, offering a device-centric perspective for reflectance analysis. Each point is then tokenized into a contextual representation comprising its sensor scanning index, radial distance, and prior reflectance, for effective dependencies exploration. For efficient sequential modeling, Mamba is incorporated with a dual parallelization scheme, enabling simultaneous autoregressive dependency capture and fast processing. Extensive experiments demonstrate that SerLiC attains over 2$\times$ volume reduction against the original reflectance data, outperforming the state-of-the-art method by up to 22% reduction of compressed bits while using only 2% of its parameters. Moreover, a lightweight version of SerLiC achieves $\geq 10$ fps (frames per second) with just 111K parameters, which is attractive for real applications.

Paperid:1336 poster

Authors:Michael Eli Sander, Vincent Roulet, Tianlin Liu, Mathieu Blondel

Title: Joint Learning of Energy-based Models and their Partition Function

Abstract:
Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function. In this paper, we propose a novel min-min formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.

Paperid:1337 poster

Authors:Joongkyu Lee, Min-hwan Oh

Title: Improved Online Confidence Bounds for Multinomial Logistic Bandits

Abstract:
In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variance-dependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly minimax-optimal regret in MNL bandits. However, their results still depend on the norm-boundedness of the unknown parameterBand the maximum size of possible outcomesK. To address this, we first derive an online confidence bound ofO(dlog⁡t+B), which is a significant improvement over the previous bound ofO(Bdlog⁡tlog⁡K)(Lee & Oh, 2024). This is mainly achieved by establishing tighter self-concordant properties of the MNL loss and introducing a novel intermediary term to bound the estimation error. Using this new online confidence bound, we propose a constant-time algorithm,OFU-MNL++, which achieves a variance-dependent regret bound ofO(dlog⁡T∑t=1Tσt2)for sufficiently largeT, whereσt2denotes the variance of the rewards at roundt,dis the dimension of the contexts, andTis the total number of rounds. Furthermore, we introduce an Maximum Likelihood Estimation (MLE)-based algorithm that achieves an anytime,OFU-M2NL,poly(B)-free regret ofO(dlog�?BT)∑t=1Tσt2).

Paperid:1338 poster

Authors:Felix Hofstätter, Teun van der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward

Title: The Elicitation Game: Evaluating Capability Elicitation Techniques

Abstract:
Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms �?language models with hidden capabilities that are revealed by a password. We introduce a novel method for training model organisms, based on circuit-breaking, which is more robust to elicitation techniques than standard password-locked models. We focus on elicitation techniques based on prompting and activation steering, and compare these to fine-tuning methods. Prompting techniques can elicit the actual capability of both password-locked and circuit- broken model organisms in an MCQA setting, while steering fails to do so. For a code-generation task, only fine-tuning can elicit the hidden capabilities of our novel model organism. Additionally, our results suggest that combining techniques improves elicitation. Still, if possible, fine-tuning should be the method of choice to improve the trustworthiness of capability evaluations.

Paperid:1339 poster

Authors:Dooho Lee, Myeong Kong, Sagad Hamid, Cheonwoo Lee, Jaemin Yoo

Title: Aggregation Buffer: Revisiting DropEdge with a New Parameter Block

Abstract:
We revisit DropEdge, a data augmentation technique for GNNs which randomly removes edges to expose diverse graph structures during training. While being a promising approach to effectively reduce overfitting on specific connections in the graph, we observe that its potential performance gain in supervised learning tasks is significantly limited. To understand why, we provide a theoretical analysis showing that the limited performance of DropEdge comes from the fundamental limitation that exists in many GNN architectures. Based on this analysis, we proposeAggregation Buffer, a parameter block specifically designed to improve the robustness of GNNs by addressing the limitation of DropEdge. Our method is compatible with any GNN model, and shows consistent performance improvements on multiple datasets. Moreover, our method effectively addresses well-known problems such as degree bias or structural disparity as a unifying solution. Code and datasets are available athttps://github.com/dooho00/agg-buffer.

Paperid:1340 poster

Authors:Minjong Yoo, Jinwoo Jang, Sihyung Yoon, Honguk Woo

Title: World Model Implanting for Test-time Adaptation of Embodied Agents

Abstract:
In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining. To address this, we present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models (LLMs) with independently learned, domain-specific world models through test-time composition. By allowing seamless implantation and removal of the world models, the embodied agent's policy achieves and maintains cross-domain adaptability. In the WorMI framework, we employ a prototype-based world model retrieval approach, utilizing efficient trajectory-based abstract representation matching, to incorporate relevant models into test-time composition. We also develop a world-wise compound attention method that not only integrates the knowledge from the retrieved world models but also aligns their intermediate representations with the reasoning model's representation within the agent's policy. This framework design effectively fuses domain-specific knowledge from multiple world models, ensuring robust adaptation to unseen domains. We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches across a range of unseen domains. These results highlight the framework’s potential for scalable, real-world deployment in embodied agent scenarios where adaptability and data efficiency are essential.

Paperid:1341 poster

Authors:Myungsik Cho, Jongeui Park, Jeonghye Kim, Youngchul Sung

Title: ARS: Adaptive Reward Scaling for Multi-Task Reinforcement Learning

Abstract:
Multi-task reinforcement learning (RL) encounters significant challenges due to varying task complexities and their reward distributions from the environment. To address these issues, in this paper, we propose Adaptive Reward Scaling (ARS), a novel framework that dynamically adjusts reward magnitudes and leverages a periodic network reset mechanism. ARS introduces a history-based reward scaling strategy that ensures balanced reward distributions across tasks, enabling stable and efficient training. The reset mechanism complements this approach by mitigating overfitting and ensuring robust convergence. Empirical evaluations on the Meta-World benchmark demonstrate that ARS significantly outperforms baseline methods, achieving superior performance on challenging tasks while maintaining overall learning efficiency. These results validate ARS's effectiveness in tackling diverse multi-task RL problems, paving the way for scalable solutions in complex real-world applications.

Paperid:1342 poster

Authors:Karish Grover, Geoffrey J. Gordon, Christos Faloutsos

Title: CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection

Abstract:
Does the intrinsic curvature of complex networks hold the key to unveiling graph anomalies that conventional approaches overlook? Reconstruction-based graph anomaly detection (GAD) methods overlook such geometric outliers, focusing only on structural and attribute-level anomalies. To this end, we propose CurvGAD - a mixed-curvature graph autoencoder that introduces the notion of curvature-based geometric anomalies. CurvGAD introduces two parallel pipelines for enhanced anomaly interpretability: (1) Curvature-equivariant geometry reconstruction, which focuses exclusively on reconstructing the edge curvatures using a mixed-curvature, Riemannian encoder and Gaussian kernel-based decoder; and (2) Curvature-invariant structure and attribute reconstruction, which decouples structural and attribute anomalies from geometric irregularities by regularizing graph curvature under discrete Ollivier-Ricci flow, thereby isolating the non-geometric anomalies. By leveraging curvature, CurvGAD refines the existing anomaly classifications and identifies new curvature-driven anomalies. Extensive experimentation over 10 real-world datasets (both homophilic and heterophilic) demonstrates an improvement of up to 6.5% over state-of-the-art GAD methods. The code is available at:https://github.com/karish-grover/curvgad.

Paperid:1343 poster

Authors:Rui Zhang, Yun Shen, Hongwei Li, Wenbo Jiang, Hanxiao Chen, Yuan Zhang, Guowen Xu, Yang Zhang

Title: The Ripple Effect: On Unforeseen Complications of Backdoor Attacks

Abstract:
Recent research highlights concerns about the trustworthiness of third-party Pre-Trained Language Models (PTLMs) due to potential backdoor attacks. These backdoored PTLMs, however, are effective only for specific pre-defined downstream tasks. In reality, these PTLMs can be adapted to many other unrelated downstream tasks. Such adaptation may lead to unforeseen consequences in downstream model outputs, consequently raising user suspicion and compromising attack stealthiness. We refer to this phenomenon as backdoor complications. In this paper, we undertake the first comprehensive quantification of backdoor complications. Through extensive experiments using 4 prominent PTLMs and 16 text classification benchmark datasets, we demonstrate the widespread presence of backdoor complications in downstream models fine-tuned from backdoored PTLMs. The output distribution of triggered samples significantly deviates from that of clean samples. Consequently, we propose a backdoor complication reduction method leveraging multi-task learning to mitigate complications without prior knowledge of downstream tasks. The experimental results demonstrate that our proposed method can effectively reduce complications while maintaining the efficacy and consistency of backdoor attacks.

Paperid:1344 poster

Authors:Jing Wang, Yu-Jie Zhang, Peng Zhao, Zhi-Hua Zhou

Title: Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update

Abstract:
We study the stochastic linear bandits with heavy-tailed noise. Two principled strategies for handling heavy-tailed noise, truncation and median-of-means, have been introduced to heavy-tailed bandits. Nonetheless, these methods rely on specific noise assumptions or bandit structures, limiting their applicability to general settings. The recent work [Huang et al.2024] develop a soft truncation method via the adaptive Huber regression to address these limitations. However, their method suffers undesired computational cost: it requires storing all historical data and performing a full pass over these data at each round. In this paper, we propose a \emph{one-pass} algorithm based on the online mirror descent framework. Our method updates using only current data at each round, reducing the per-round computational cost fromO(tlog⁡T)toO(1)with respect to current roundtand the time horizonT, and achieves a near-optimal and variance-aware regret of orderO~(dT1−�?(1+ε)∑t=1Tνt2+dT1−�?(1+ε))wheredis the dimension andνt1+εis the(1+ε)-th central moment of reward at roundt.

Paperid:1345 poster

Authors:Yuxuan Song, Juntong Shi, Jingjing Gong, Minkai Xu, Stefano Ermon, Hao Zhou, Wei-Ying Ma

Title: Smooth Interpolation for Improved Discrete Graph Generative Models

Abstract:
Though typically represented by the discrete node and edge attributes, the graph topological information can be sufficiently captured by the graph spectrum in a continuous space. It is believed that incorporating the continuity of graph topological information into the generative process design could establish a superior paradigm for graph generative modeling. Motivated by such prior and recent advancements in the generative paradigm, we propose Graph Bayesian Flow Networks (GraphBFN) in this paper, a principled generative framework that designs an alternative generative process emphasizing the dynamics of topological information. Unlike recent discrete-diffusion-based methods, GraphBFNemploys the continuous counts derived from sampling infinite times from a categorical distribution as latent to facilitate a smooth decomposition of topological information, demonstrating enhanced effectiveness. To effectively realize the concept, we further develop an advanced sampling strategy and new time-scheduling techniques to overcome practical barriers and boost performance. Through extensive experimental validation on both generic graph and molecular graph generation tasks, GraphBFN could consistently achieve superior or competitive performance with significantly higher training and sampling efficiency.

Paperid:1346 poster

Authors:Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang

Title: Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Abstract:
Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses, despite having access to preference data that includes reward scores from judge models during AI feedback. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to optimal responses that are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. The experiments across various benchmarks and diverse models demonstrate that our approach consistently boosts DPO by a considerable margin. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion. Our code is available athttps://github.com/shenao-zhang/reward-augmented-preference.

Paperid:1347 poster

Authors:Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, Tongliang Liu

Title: Ranked from Within: Ranking Large Multimodal Models Without Labels

Abstract:
Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate 47 state-of-the-art LMMs (e.g., LLaVA) across 9 visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.

Paperid:1348 poster

Authors:Chao Li, Jiawei Fan, Anbang Yao

Title: Morse: Dual-Sampling for Lossless Acceleration of Diffusion Models

Abstract:
In this paper, we present $Morse$, a simple dual-sampling framework for accelerating diffusion models losslessly. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called $Dash$ and $Dot$ that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. Our method shows a lossless speedup of 1.78$\times$ to 3.31$\times$ on average over a wide range of sampling step budgets relative to 9 baseline diffusion models on 6 image generation tasks. Furthermore, we show that our method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code and models are available athttps://github.com/deep-optimization/Morse.

Paperid:1349 poster

Authors:Qilong Wu, Yiyang Shao, Jun Wang, Xiaobo Sun

Title: Learning Optimal Multimodal Information Bottleneck Representations

Abstract:
Leveraging high-quality joint representations from multimodal data can greatly enhance model performance in various machine-learning based applications. Recent multimodal learning methods, based on the multimodal information bottleneck (MIB) principle, aim to generate optimal MIB with maximal task-relevant information and minimal superfluous information via regularization. However, these methods often set regularization weights in anad hocmanner and overlook imbalanced task-relevant information across modalities, limiting their ability to achieve optimal MIB. To address this gap, we propose a novel multimodal learning framework, Optimal Multimodal Information Bottleneck (OMIB), whose optimization objective guarantees the achievability of optimal MIB by setting the regularization weight within a theoretically derived bound. OMIB further addresses imbalanced task-relevant information by dynamically adjusting regularization weights per modality, ensuring the inclusion of all task-relevant information. Moreover, we establish a solid information-theoretical foundation for OMIB's optimization and implement it under the variational approximation framework for computational efficiency. Finally, we empirically validate the OMIB’s theoretical properties on synthetic data and demonstrate its superiority over the state-of-the-art benchmark methods in various downstream tasks.

Paperid:1350 poster

Authors:Patara Trirat, Wonyong Jeong, Sung Ju Hwang

Title: AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Abstract:
Automated machine learning (AutoML) accelerates AI development by automating tasks in the development pipeline, such as optimal model search and hyperparameter tuning. Existing AutoML systems often require technical expertise to set up complex tools, which is in general time-consuming and requires a large amount of human effort. Therefore, recent works have started exploiting large language models (LLM) to lessen such burden and increase the usability of AutoML frameworks via a natural language interface, allowing non-expert users to build their data-driven solutions. These methods, however, are usually designed only for a particular process in the AI development pipeline and do not efficiently use the inherent capacity of the LLMs. This paper proposesAutoML-Agent, a novel multi-agent framework tailored for full-pipeline AutoML, i.e., from data retrieval to model deployment.AutoML-Agenttakes user's task descriptions, facilitates collaboration between specialized LLM agents, and delivers deployment-ready models. Unlike existing work, instead of devising a single plan, we introduce a retrieval-augmented planning strategy to enhance exploration to search for more optimal plans. We also decompose each plan into sub-tasks (e.g., data preprocessing and neural network design) each of which is solved by a specialized agent we build via prompting executing in parallel, making the search process more efficient. Moreover, we propose a multi-stage verification to verify executed results and guide the code generation LLM in implementing successful solutions. Extensive experiments on seven downstream tasks using fourteen datasets show thatAutoML-Agentachieves a higher success rate in automating the full AutoML process, yielding systems with good performance throughout the diverse domains.

Paperid:1351 poster

Authors:Sophia Hager, Aleem Khan, Andrew Wang, Nicholas Andrews

Title: Learning Extrapolative Sequence Transformations from Markov Chains

Abstract:
Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emph{extrapolate} beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well-designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.

Paperid:1352 poster

Authors:Katherine Tieu, Dongqi Fu, Zihao Li, Ross Maciejewski, Jingrui He

Title: Learnable Spatial-Temporal Positional Encoding for Link Prediction

Abstract:
Accurate predictions rely on the expressiveness power of graph deep learning frameworks like graph neural networks and graph transformers, where a positional encoding mechanism has become much more indispensable in recent state-of-the-art (SOTA) works to record the canonical position information. However, the current positional encoding limits in three aspects, at least: (1) most positional encodings are pre-defined, and fixed functions, which are inadequate to adapt to the complex attributed graphs; (2) a few pioneering works propose the learnable positional encoding but still limited to the structural information, leaving the real-world time-evolving topological and feature information untouched; (3) most positional encodings should be equipped with transformer's attention mechanism to fully release the power, where the dense or relational attention is often unaffordable on large-scale structured data. Hence, we study the possibility of Learnable Spatial-Temporal Positional Encoding in an effective and efficient manner and then propose a simple temporal link prediction model named L-STEP. Briefly, for L-STEP, we (1) prove the proposed positional learning scheme can preserve the graph property from the spatial-temporal spectral viewpoint, (2) verify that MLPs can fully exploit the expressiveness and reach Transformers' performance on that encoding, (3) change different initial positional encoding inputs to show robustness, (4) analyze the theoretical complexity and obtain less empirical running time than SOTA, and (5) demonstrate its temporal link prediction out-performance on 13 classic datasets and with 10 algorithms in both transductive and inductive settings using 3 different sampling strategies. Also, L-STEP obtains the leading performance in the newest large-scale TGB benchmark.

Paperid:1353 poster

Authors:Dheeraj Baby, Yifei Tang, Hieu Duy Nguyen, Yu-Xiang Wang, Rohit Pyati

Title: Adaptive Estimation and Learning under Temporal Distribution Shift

Abstract:
In this paper, we study the problem of estimation and learning under temporal distribution shift. Consider an observation sequence of length $n$, which is a noisy realization of a time-varying ground-truth sequence. Our focus is to develop methods to estimate the groundtruth at the final time-step while providing sharp point-wise estimation error rates. We show that,without prior knowledgeon the level of temporal shift, a wavelet soft-thresholding estimator provides anoptimalestimation error bound for the groundtruth. Our proposed estimation method generalizes existing researches (Mazetto and Upfal, 2023) by establishing a connection between the sequence's non-stationarity level and the sparsity in the wavelet-transformed domain. Our theoretical findings are validated by numerical experiments. Additionally, we applied the estimator to derive sparsity-aware excess risk bounds for binary classification under distribution shift and to develop computationally efficient training objectives. As a final contribution, we draw parallels between our results and the classical signal processing problem of total-variation denoising (Mammen and van de Geer 1997; Tibshirani 2014 ), uncoveringnovel optimalalgorithms for such task.

Paperid:1354 poster

Authors:Jingwei Li, Jing Dong, Tianxing He, Jingzhao Zhang

Title: Towards Black-Box Membership Inference Attack for Diffusion Models

Abstract:
Given the rising popularity of AI-generated art and the associated copyright concerns, identifying whether an artwork was used to train a diffusion model is an important research topic. The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitation of applying existing MIA methods for proprietary diffusion models: the required access of internal U-nets. To address the above problem, we introduce a novel membership inference attack method that uses only the image-to-image variation API and operates without access to the model's internal U-net. Our method is based on the intuition that the model can more easily obtain an unbiased noise prediction estimate for images from the training set. By applying the API multiple times to the target image, averaging the outputs, and comparing the result to the original image, our approach can classify whether a sample was part of the training set. We validate our method using DDIM and Stable Diffusion setups and further extend both our approach and existing algorithms to the Diffusion Transformer architecture. Our experimental results consistently outperform previous methods.

Paperid:1355 poster

Authors:Jingruo Sun, Wenzhi Gao, Ellen Vitercik, Yinyu Ye

Title: Wait-Less Offline Tuning and Re-solving for Online Decision Making

Abstract:
Online linear programming (OLP) has found broad applications in revenue management and resource allocation. State-of-the-art OLP algorithms achieve low regret by repeatedly solving linear programming (LP) subproblems that incorporate updated resource information. However, LP-based methods are computationally expensive and often inefficient for large-scale applications. By contrast, recent first-order OLP algorithms are more computationally efficient but typically suffer from weaker regret guarantees. To address these shortcomings, we propose a new algorithm that combines the strengths of LP-based and first-order OLP algorithms. Our algorithm re-solves the LP subproblems periodically at a predefined frequency $f$ and uses the latest dual prices to guide online decision-making. In parallel, a first-order method runs during each interval between LP re-solves and smooths resource consumption. Our algorithm achieves $\mathcal{O}(\log (T/f) + \sqrt{f})$ regret and delivers a "wait-less" online decision-making process that balances computational efficiency and regret guarantees. Extensive experiments demonstrate at least $10$-fold improvements in regret over first-order methods and $100$-fold improvements in runtime over LP-based methods.

Paperid:1356 poster

Authors:Chenhao Sun, Yuhao Mao, Mark Niklas Mueller, Martin Vechev

Title: Average Certified Radius is a Poor Metric for Randomized Smoothing

Abstract:
Randomized smoothing (RS) is popular for providing certified robustness guarantees against adversarial attacks. The average certified radius (ACR) has emerged as a widely used metric for tracking progress in RS. However, in this work, for the first time we show that ACR is a poor metric for evaluating robustness guarantees provided by RS. We theoretically prove not only that a trivial classifier can have arbitrarily large ACR, but also that ACR is extremely sensitive to improvements on easy samples. In addition, the comparison using ACR has a strong dependence on the certification budget. Empirically, we confirm that existing training strategies, though improving ACR, reduce the model's robustness on hard samples consistently. To strengthen our findings, we propose strategies, including explicitly discarding hard samples, reweighing the dataset with approximate certified radius, and extreme optimization for easy samples, to replicate the progress in RS training and even achieve the state-of-the-art ACR on CIFAR-10, without training for robustness on the full data distribution. Overall, our results suggest that ACR has introduced a strong undesired bias to the field, and its application should be discontinued in RS. Finally, we suggest using the empirical distribution of $p_A$, the accuracy of the base model on noisy data, as an alternative metric for RS.

Paperid:1357 poster

Authors:Khai Nguyen, Hai Nguyen, Tuan Pham, Nhat Ho

Title: Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

Abstract:
We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.

Paperid:1358 poster

Authors:Mingquan Feng, Weixin Liao, Yixin Huang, Yifan Fu, Qifu Zheng, Junchi Yan

Title: HEAP: Hyper Extended A-PDHG Operator for Constrained High-dim PDEs

Abstract:
Neural operators have emerged as a promising approach for solving high-dimensional partial differential equations (PDEs). However, existing neural operators often have difficulty in dealing with constrained PDEs, where the solution must satisfy additional equality or inequality constraints beyond the governing equations. To close this gap, we propose a novel neural operator, Hyper Extended Adaptive PDHG (HEAP) for constrained high-dim PDEs, where the learned operator evolves in the parameter space of PDEs. We first show that the evolution operator learning can be formulated as a quadratic programming (QP) problem, then unroll the adaptive primal-dual hybrid gradient (APDHG) algorithm as the QP-solver into the neural operator architecture. This allows us to improve efficiency while retaining theoretical guarantees of the constrained optimization. Empirical results on a variety of high-dim PDEs show that HEAP outperforms the state-of-the-art neural operator model.

Paperid:1359 poster

Authors:Matteo Gasparin, Aaditya Ramdas

Title: Improving the Statistical Efficiency of Cross-Conformal Prediction

Abstract:
Vovk (2015) introduced cross-conformal prediction, a modification of split conformal designed to improve the width of prediction sets. The method, when trained with a miscoverage rate equal to $\alpha$ and $n \gg K$, ensures a marginal coverage of at least $1 - 2\alpha - 2(1-\alpha)(K-1)/(n+K)$, where $n$ is the number of observations and $K$ denotes the number of folds. A simple modification of the method achieves coverage of at least $1-2\alpha$. In this work, we propose new variants of both methods that yield smaller prediction sets without compromising the latter theoretical guarantees. The proposed methods are based on recent results deriving more statistically efficient combination of p-values that leverage exchangeability and randomization. Simulations confirm the theoretical findings and bring out some important tradeoffs.

Paperid:1360 poster

Authors:Juha Harviainen, Frank Sommer, Manuel Sorge, Stefan Szeider

Title: Optimal Decision Tree Pruning Revisited: Algorithms and Complexity

Abstract:
We present a comprehensive classical and parameterized complexity analysis of decision tree pruning operations, extending recent research on the complexity of learning small decision trees. Thereby, we offer new insights into the computational challenges of decision tree simplification, a crucial aspect of developing interpretable and efficient machine learning models. We focus on fundamental pruning operations of subtree replacement and raising, which are used in heuristics. Surprisingly, while optimal pruning can be performed in polynomial time for subtree replacement, the problem is NP-complete for subtree raising. Therefore, we identify parameters and combinations thereof that lead to fixed-parameter tractability or hardness, establishing a precise borderline between these complexity classes. For example, while subtree raising is hard for small domain size $D$ or number $d$ of features, it can be solved in $D^{2d} \cdot |I|^{O(1)}$ time, where $|I|$ is the input size. We complement our theoretical findings with preliminary experimental results, demonstrating the practical implications of our analysis.

Paperid:1361 poster

Authors:Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, Salman Khan

Title: GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing

Abstract:
Recent advances in large multimodal models (LMMs) have recognized fine-grained grounding as an imperative factor of visual understanding and dialogue. However, the benefits of such representation in LMMs are limited to the natural image domain, and these models perform poorly for remote sensing (RS). The distinct overhead viewpoint, scale variation, and presence of small objects in high-resolution RS imagery present a unique challenge in region-level comprehension. Moreover, the development of the grounding conversation capability of LMMs within RS is hindered by the lack of granular, RS domain-specific grounded data. Addressing these limitations, we propose GeoPixel - the first end-to-end high-resolution RS-LMM that supports pixel-level grounding. This capability allows fine-grained visual perception by generating interleaved masks in conversation. GeoPixel supports up to 4K HD resolution in any aspect ratio, ideal for high-precision RS image analysis. To support the grounded conversation generation (GCG) in RS imagery, we curate a visually grounded dataset GeoPixelD through a semi-automated pipeline that utilizes set-of-marks prompting and spatial priors tailored for RS data to methodically control the data generation process. GeoPixel demonstrates superior performance in pixel-level comprehension, surpassing existing LMMs in both single-target and multi-target segmentation tasks. Our methodological ablation studies validate the effectiveness of each component in the overall architecture. Our code and data will be publicly released.

Paperid:1362 poster

Authors:Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, LEI BAI, Ruimao Zhang

Title: WorldSimBench: Towards Video Generation Models as World Simulators

Abstract:
Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate higher-capability, highly embodied predictive models from an embodied perspective. In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks, covering three representative embodied scenarios: Open-Ended Embodied Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment dataset based on fine-grained human feedback, which we use to train a Human Preference Evaluator that aligns with human perception and explicitly assesses the visual fidelity of World Simulater. In the Implicit Manipulative Evaluation, we assess the video-action consistency of World Simulators by evaluating whether the generated situation-aware video can be accurately translated into the correct control signals in dynamic environments. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.

Paperid:1363 poster

Authors:Zeqiong Lv, Chao Qian, Yun Liu, Jiahao Fan, Yanan Sun

Title: Runtime Analysis of Evolutionary NAS for Multiclass Classification

Abstract:
Evolutionary neural architecture search (ENAS) is a key part of evolutionary machine learning, which commonly utilizes evolutionary algorithms (EAs) to automatically design high-performing deep neural architectures. During past years, various ENAS methods have been proposed with exceptional performance. However, the theory research of ENAS is still in the infant. In this work, we step for the runtime analysis, which is an essential theory aspect of EAs, of ENAS upon multiclass classification problems. Specifically, we first propose a benchmark to lay the groundwork for the analysis. Furthermore, we design a two-level search space, making it suitable for multiclass classification problems and consistent with the common settings of ENAS. Based on both designs, we consider (1+1)-ENAS algorithms with one-bit and bit-wise mutations, and analyze their upper and lower bounds on the expected runtime. We prove that the algorithm using both mutations can find the optimum with the expected runtime upper bound ofO(rMln⁡rM)and lower bound ofΩ(rMln⁡M). This suggests that a simple one-bit mutation may be greatly considered, given that most state-of-the-art ENAS methods are laboriously designed with the bit-wise mutation. Empirical studies also support our theoretical proof.

Paperid:1364 poster

Authors:Siqi Zhang, Xing Huang, Feihu Huang

Title: Generalized Smooth Bilevel Optimization with Nonconvex Lower-Level

Abstract:
Bilevel optimization is widely applied in many machine learning tasks such as hyper-parameter learning and meta learning. Recently, many algorithms have been proposed to solve these bilevel optimization problems, which rely on the smoothness condition of objective functions of the bilevel optimization. In fact, some machine learning tasks such as learning language model do not satisfy the smoothness condition of objective functions. More recently, some methods have begun to study generalized smooth bilevel optimization. However, these proposed methods for generalized smooth bilevel optimization only focus on the (strongly) convex lower objective function. Meanwhile, these methods only consider the generalized-smooth upper-level objective, but still require the standard smooth lower-level objective in the bilevel optimization. To fill this gap, in the paper, thus we study the generalized-smooth bilevel optimization with the nonconvex lower-level objective function, where both upper-level and lower-level objectives are generalized-smooth. We propose an efficient single-loop Hessian/Jacobian-free penalty normalized gradient (i.e., PNGBiO) method. Moreover, we prove that our PNGBiO obtains a fast convergence rate of $O(\frac{1}{T^{1/4}})$ for finding a stationary solution, where $T$ denotes the iteration number. Meanwhile, we also propose a stochastic version of our PNGBiO (i.e., S-PNGBiO) method to solve stochastic bilevel problems, and prove that our S-PNGBiO has a fast convergence rate of $O(\frac{1}{T^{1/6}})$. Some experimental results on hyper-parameter learning and meta learning demonstrate efficiency of our proposed methods.

Paperid:1365 poster

Authors:Christophe Roux, David Martínez-Rubio, Sebastian Pokutta

Title: Implicit Riemannian Optimism with Applications to Min-Max Problems

Abstract:
We introduce a Riemannian optimistic online learning algorithm for Hadamard manifolds based on inexact implicit updates. Unlike prior work, our method can handle in-manifold constraints, and matches the best known regret bounds in the Euclidean setting with no dependence on geometric constants, like the minimum curvature. Building on this, we develop algorithms for g-convex, g-concave smooth min-max problems on Hadamard manifolds. Notably, one method nearly matches the gradient oracle complexity of the lower bound for Euclidean problems, for the first time.

Paperid:1366 poster

Authors:Charlie B. Tan, Joey Bose, Chen Lin, Leon Klein, Michael M. Bronstein, Alexander Tong

Title: Scalable Equilibrium Sampling with Sequential Boltzmann Generators

Abstract:
Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann generators tackle this problem by pairing normalizing flows with importance sampling to obtain uncorrelated samples under the target distribution. In this paper, we extend the Boltzmann generator framework with two key contributions, denoting our framework Sequential Boltzmann Generators (SBG). The first is a highly efficient Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. In contrast to the equivariant continuous flows of prior methods, we leverage exactly invertible non-equivariant architectures which are highly efficient during both sample generation and likelihood evaluation. This efficiency unlocks more sophisticated inference strategies beyond standard importance sampling. In particular, we perform inference-time scaling of flow samples using a continuous-time variant of sequential Monte Carlo, in which flow samples are transported towards the target distribution with annealed Langevin dynamics. SBG achieves state-of-the-art performance w.r.t. all metrics on peptide systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri-, tetra- and hexa-peptides that were thus far intractable for prior Boltzmann generators.

Paperid:1367 poster

Authors:Chenyi yang, Wenjie Nie, Yuxin Zhang, Yuhang Wu, Xiawu Zheng, GUANNAN JIANG, Rongrong Ji

Title: BAME: Block-Aware Mask Evolution for Efficient N:M Sparse Training

Abstract:
N:M sparsity stands as a progressively important tool for DNN compression, achieving practical speedups by stipulating at most N non-zero components within M sequential weights. Unfortunately, most existing works identify the N:M sparse mask through dense backward propagation to update all weights, which incurs exorbitant training costs. In this paper, we introduce BAME, a method that maintains consistent sparsity throughout the N:M sparse training process. BAME perpetually keeps both sparse forward and backward propagation, while iteratively performing weight pruning-and-regrowing within designated weight blocks to tailor the N:M mask. These blocks are selected through a joint assessment based on accumulated mask oscillation frequency and expected loss reduction of mask adaptation, thereby ensuring stable and efficient identification of the optimal N:M mask. Our empirical results substantiate the effectiveness of BAME, illustrating it performs comparably to or better than previous works that fully maintaining dense backward propagation during training. For instance, BAME attains a 72.0% top-1 accuracy while training a 1:16 sparse ResNet-50 on ImageNet, eclipsing SR-STE by 0.5%, despite achieving 2.37 training FLOPs reduction. Code is released at \url{https://github.com/BAME-xmu/BAME}

Paperid:1368 poster

Authors:Jan Ludziejewski, Maciej Pióro, Jakub Krajewski, Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Marek Cygan, Piotr Sankowski, Kamil Adamczewski, Piotr Miłoś, Sebastian Jaszczur

Title: Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

Abstract:
Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their scalability and efficiency under memory constraints remain relatively underexplored. In this work, we present joint scaling laws for dense and MoE models, incorporating key factors such as the number of active parameters, dataset size, and the number of experts. Our findings provide a principled framework for selecting the optimal MoE configuration under fixed memory and compute budgets. Surprisingly, we show that MoE models can be more memory-efficient than dense models, contradicting conventional wisdom. Extensive empirical validation confirms the theoretical predictions of our scaling laws. These results offer actionable insights for designing and deploying MoE models in practical large-scale training scenarios.

Paperid:1369 poster

Authors:Woohyun Lee, Hogun Park

Title: Self-supervised Adversarial Purification for Graph Neural Networks

Abstract:
Defending Graph Neural Networks (GNNs) against adversarial attacks requires balancing accuracy and robustness, a trade-off often mishandled by traditional methods like adversarial training that intertwine these conflicting objectives within a single classifier. To overcome this limitation, we propose a self-supervised adversarial purification framework. We separate robustness from the classifier by introducing a dedicated purifier, which cleanses the input data before classification. In contrast to prior adversarial purification methods, we propose GPR-GAE, a novel graph auto-encoder (GAE), as a specialized purifier trained with a self-supervised strategy, adapting to diverse graph structures in a data-driven manner. Utilizing multiple Generalized PageRank (GPR) filters, GPR-GAE captures diverse structural representations for robust and effective purification. Our multi-step purification process further facilitates GPR-GAE to achieve precise graph recovery and robust defense against structural perturbations. Experiments across diverse datasets and attack scenarios demonstrate the state-of-the-art robustness of GPR-GAE, showcasing it as an independent plug-and-play purifier for GNN classifiers. Our code can be found inhttps://github.com/woodavid31/GPR-GAE.

Paperid:1370 poster

Authors:Chenze Shao, Fandong Meng, Jie Zhou

Title: Continuous Visual Autoregressive Generation via Score Maximization

Abstract:
Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: \url{https://github.com/shaochenze/EAR}.

Paperid:1371 poster

Authors:Noam Itzhak Levi, Yaron Oz

Title: The Underlying Universal Statistical Structure of Natural Datasets

Abstract:
We study universal properties in real-world complex and synthetically generated datasets. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. Examining the local and global eigenvalue statistics of feature-feature covariance matrices, we find: (i) bulk eigenvalue power-law scaling vastly differs between uncorrelated Gaussian and real-world data, (ii) this power law behavior is reproducible using Gaussian data with long-range correlations, (iii) all dataset types exhibit chaotic RMT universality, (iv) RMT statistics emerge at smaller dataset sizes than typical training sets, correlating with power-law convergence, (v) Shannon entropy correlates with RMT structure and requires fewer samples in strongly correlated datasets. These results suggest natural image Gram matrices can be approximated by Wishart random matrices with simple covariance structure, enabling rigorous analysis of neural network behavior.

Paperid:1372 poster

Authors:Konstantin Kirchheim, Frank Ortmeier

Title: Improving Out-of-Distribution Detection with Markov Logic Networks

Abstract:
Out-of-distribution (OOD) detection is essential for ensuring the reliability of deep learning models operating in open-world scenarios. Current OOD detectors mainly rely on statistical models to identify unusual patterns in the latent representations of a deep neural network. This work proposes to augment existing OOD detectors with probabilistic reasoning, utilizing Markov logic networks (MLNs). MLNs connect first-order logic with probabilistic reasoning to assign probabilities to inputs based on weighted logical constraints defined over human-understandable concepts, which offers improved explainability. Through extensive experiments on multiple datasets, we demonstrate that MLNs can significantly enhance the performance of a wide range of existing OOD detectors while maintaining computational efficiency. Furthermore, we introduce a simple algorithm for learning logical constraints for OOD detection from a dataset and showcase its effectiveness.

Paperid:1373 poster

Authors:Marcel Hedman, Desi R. Ivanova, Cong Guan, Tom Rainforth

Title: Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

Abstract:
We develop a semi-amortized, policy-based, approach to Bayesian experimental design (BED) called Stepwise Deep Adaptive Design (Step-DAD). Like existing, fully amortized, policy-based BED approaches, Step-DAD trains a design policy upfront before the experiment. However, rather than keeping this policy fixed, Step-DAD periodically updates it as data is gathered, refining it to the particular experimental instance. This test-time adaptation improves both the flexibility and the robustness of the design strategy compared with existing approaches. Empirically, Step-DAD consistently demonstrates superior decision-making and robustness compared with current state-of-the-art BED methods.

Paperid:1374 poster

Authors:Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang

Title: Understanding the Emergence of Multimodal Representation Alignment

Abstract:
Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research has primarily focused onexplicitlyaligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can becomeimplicitlyaligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not necessarily limited to, the degree of similarity between the modalities and the balance between redundant and unique information they provide for the task. Our findings suggest that alignment may not be universally beneficial; rather, its impact on performance varies depending on the dataset and task. These insights can help practitioners determine whether increasing alignment between modalities is advantageous or, in some cases, detrimental to achieving optimal performance.

Paperid:1375 poster

Authors:Zhihao Li, Xue Jiang, Liyuan Liu, Xuelin Zhang, Hong Chen, Feng Zheng

Title: On the Generalization Ability of Next-Token-Prediction Pretraining

Abstract:
Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model's generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequencesN, the maximum length of sequencem, and the count of parameters in the transformer modelΘ. Additionally, experiments on public datasets verify our theoretical findings.

Paperid:1376 poster

Authors:Zequn Yang, Hongfa Wang, Di Hu

Title: Efficient Quantification of Multimodal Interaction at Sample Level

Abstract:
Interactions between modalities—redundancy, uniqueness, and synergy—collectively determine the composition of multimodal information. Understanding these interactions is crucial for analyzing information dynamics in multimodal systems, yet their accurate sample-level quantification presents significant theoretical and computational challenges. To address this, we introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory. We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable and measurable interaction. Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation, specifically tailored for sample-wise estimation in continuous distributions. Extensive experiments on synthetic and real-world datasets validate LSMI's precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sample- and category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling. The code is available athttps://github.com/GeWu-Lab/LSMI_Estimator.

Paperid:1377 poster

Authors:Kaixuan Xu, Jiajun Chai, Sicheng Li, Yuqian Fu, Yuanheng Zhu, Dongbin Zhao

Title: DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy

Abstract:
Diplomacy is a complex multiplayer game that re- quires both cooperation and competition, posing significant challenges for AI systems. Traditional methods rely on equilibrium search to generate extensive game data for training, which demands substantial computational resources. Large Lan- guage Models (LLMs) offer a promising alterna- tive, leveraging pre-trained knowledge to achieve strong performance with relatively small-scale fine-tuning. However, applying LLMs to Diplo- macy remains challenging due to the exponential growth of possible action combinations and the intricate strategic interactions among players. To address this challenge, we propose DipLLM, a fine-tuned LLM-based agent that learns equilib- rium policies for Diplomacy. DipLLM employs an autoregressive factorization framework to sim- plify the complex task of multi-unit action assign- ment into a sequence of unit-level decisions. By defining an equilibrium policy within this frame- work as the learning objective, we fine-tune the model using only 1.5% of the data required by the state-of-the-art Cicero model, surpassing its per- formance. Our results demonstrate the potential of fine-tuned LLMs for tackling complex strategic decision-making in multiplayer games.

Paperid:1378 poster

Authors:Yingying Feng, Jie Li, Chi Xie, Lei Tan, Jiayi Ji

Title: Multi-Modal Object Re-identification via Sparse Mixture-of-Experts

Abstract:
We present MFRNet, a novel network for multi-modal object re-identification that integrates multi-modal data features to effectively retrieve specific objects across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8.4% mAP and 6.9% accuracy improved in RGBNT201 with negligible additional parameters.

Paperid:1379 poster

Authors:Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li

Title: EpiCoder: Encompassing Diversity and Complexity in Code Generation

Abstract:
Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available.

Paperid:1380 poster

Authors:Xin Yang, Xingrun Li, Heng Chang, Yang jinze, Xihong Yang, Shengyu Tao, Maiko Shigeno, Ningkang Chang, Junfeng Wang, Dawei Yin, Erxue Min

Title: Hgformer: Hyperbolic Graph Transformer for Collaborative Filtering

Abstract:
Recommender systems are increasingly spreading to different areas like e-commerce or video streaming to alleviate information overload. One of the most fundamental methods for recommendation is Collaborative Filtering (CF), which leverages historical user-item interactions to infer user preferences. In recent years, Graph Neural Networks (GNNs) have been extensively studied to capture graph structures in CF tasks. Despite this remarkable progress, local structure modeling and embedding distortion still remain two notable limitations in the majority of GNN-based CF methods. Therefore, in this paper, we propose a novel Hyperbolic Graph Transformer architecture, to tackle the long-tail problems in CF tasks. Specifically, the proposed framework is comprised of two essential modules: 1) Local Hyperbolic Graph Convolutional Network (LHGCN), which performs graph convolution entirely in the hyperbolic manifold and captures the local structure of each node; 2) Hyperbolic Transformer, which is comprised of hyperbolic cross-attention mechanisms to capture global information. Furthermore, to enable its feasibility on large-scale data, we introduce an unbiased approximation of the cross-attention for linear computational complexity, with a theoretical guarantee in approximation errors. Empirical experiments demonstrate that our proposed model outperforms the leading collaborative filtering methods and significantly mitigates the long-tail issue in CF tasks. Our implementations are available in \url{https://github.com/EnkiXin/Hgformer}.

Paperid:1381 poster

Authors:Guiping Cao, Tao Wang, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang

Title: Open-Det: An Efficient Learning Framework for Open-Ended Detection

Abstract:
Open-Ended object Detection (OED) is a novel and challenging task that detects objects and generates their category names in a free-form manner, without requiring additional vocabularies during inference. However, the existing OED models, such as GenerateU, require large-scale datasets for training, suffer from slow convergence, and exhibit limited performance. To address these issues, we present a novel and efficient Open-Det framework, consisting of four collaborative parts. Specifically, Open-Det accelerates model training in both the bounding box and object name generation process by reconstructing the Object Detector and the Object Name Generator. To bridge the semantic gap between Vision and Language modalities, we propose a Vision-Language Aligner with V-to-L and L-to-V alignment mechanisms, incorporating with the Prompts Distiller to transfer knowledge from the VLM into VL-prompts, enabling accurate object name generation for the LLM. In addition, we design a Masked Alignment Loss to eliminate contradictory supervision and introduce a Joint Loss to enhance classification, resulting in more efficient training. Compared to GenerateU, Open-Det, using only 1.5% of the training data (0.077M vs. 5.077M), 20.8% of the training epochs (31 vs. 149), and fewer GPU resources (4 V100 vs. 16 A100), achieves even higher performance (+1.0% in APr). The source codes are available at:https://github.com/Med-Process/Open-Det.

Paperid:1382 poster

Authors:Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yixuan Wang, Chung-Wei Lin, Chen Lv, Qi Zhu, Jürgen Schmidhuber, Chao Huang

Title: Directly Forecasting Belief for Reinforcement Learning with Delays

Abstract:
Reinforcement learning (RL) with delays is challenging as sensory perceptions lag behind the actual events: the RL agent needs to estimate the real state of its environment based on past observations. State-of-the-art (SOTA) methods typically employ recursive, step-by-step forecasting of states. This can cause the accumulation of compounding errors. To tackle this problem, our novel belief estimation method, named Directly Forecasting Belief Transformer (DFBT), directly forecasts states from observations without incrementally estimating intermediate states step-by-step. We theoretically demonstrate that DFBT greatly reduces compounding errors of existing recursively forecasting methods, yielding stronger performance guarantees. In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT's capability to forecast state sequences also facilitates multi-step bootstrapping, thus greatly improving learning efficiency. On the MuJoCo benchmark, our DFBT-based method substantially outperforms SOTA baselines. Code is available at \href{https://github.com/QingyuanWuNothing/DFBT}{https://github.com/QingyuanWuNothing/DFBT}.

Paperid:1383 poster

Authors:Qi Ju, Thomas Tellier, Meng Sun, Zhemei Fang, YunFeng Luo

Title: Preference-CFR: Beyond Nash Equilibrium for Better Game Strategies

Abstract:
Artificial intelligence (AI) has surpassed top human players in a variety of games. In imperfect information games, these achievements have primarily been driven by Counterfactual Regret Minimization (CFR) and its variants for computing Nash equilibrium. However, most existing research has focused on maximizing payoff, while largely neglecting the importance of strategic diversity and the need for varied play styles, thereby limiting AI’s adaptability to different user preferences.To address this gap, we propose Preference-CFR (Pref-CFR), a novel method that incorporates two key parameters: preference degree and vulnerability degree. These parameters enable the AI to adjust its strategic distribution within an acceptable performance loss threshold, thereby enhancing its adaptability to a wider range of strategic demands. In our experiments with Texas Hold’em, Pref-CFR successfully trained Aggressive and Loose Passive styles that not only match original CFR-based strategies in performance but also display clearly distinct behavioral patterns. Notably, for certain hand scenarios, Pref-CFR produces strategies that diverge significantly from both conventional expert heuristics and original CFR outputs, potentially offering novel insights for professional players.

Paperid:1384 poster

Authors:Yaofo Chen, Zeng You, Shuhai Zhang, Haokun Li, Yirui Li, Yaowei Wang, Mingkui Tan

Title: Core Context Aware Transformers for Long Context Language Modeling

Abstract:
Transformer-based Large Language Models (LLMs) have exhibited remarkable success in extensive tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute attention. However, when the context length L becomes very large (e.g., 128K), the amount of potentially redundant information in the context tends to increase. The redundant context not only hampers the modeling representation performance but also incurs unnecessary computational and storage overhead. In this paper, we propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling, comprising two complementary modules: 1) Globality-aware pooling module groups input tokens and dynamically compresses each group into one core token based on their significance. In this way, our method automatically focuses and strengthens core context while diminishing redundancy during the learning process, leading to effective long-term dependency modeling. 2) Locality-preserving module incorporates neighboring tokens to preserve local context for detailed representation. Notably, our CCA-Attention is able to replace the self-attention module in existing LLMs with minimal fine-tuning cost. Extensive experimental results show the superiority of our method in both long-context modeling and computational efficiency over state-of-the-art methods.

Paperid:1385 poster

Authors:Youssef Allouah, Rachid Guerraoui, John Stephan

Title: Towards Trustworthy Federated Learning with Untrusted Participants

Abstract:
Resilience against malicious participants and data privacy are essential for trustworthy federated learning, yet achieving both with good utility typically requires the strong assumption of a trusted central server. This paper shows that a significantly weaker assumption suffices: each pair of participants shares a randomness seed unknown to others.In a setting where malicious participants may collude with an untrusted server, we propose CafCor, an algorithm that integrates robust gradient aggregation with correlated noise injection, using shared randomness between participants. We prove that CafCor achieves strong privacy-utility trade-offs, significantly outperforming local differential privacy (DP) methods, which do not make any trust assumption, while approaching central DP utility, where the server is fully trusted.Empirical results on standard benchmarks validate CafCor's practicality, showing that privacy and robustness can coexist in distributed systems without sacrificing utility or trusting the server.

Paperid:1386 poster

Authors:Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

Title: XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

Abstract:
The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces the Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. This work represents a significant step forward in protecting intellectual property and ensuring the authenticity of audio content in the era of generative AI.

Paperid:1387 poster

Authors:Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam

Title: Deliberation in Latent Space via Differentiable Cache Augmentation

Abstract:
Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently improves performance across a range of reasoning-intensive tasks.

Paperid:1388 poster

Authors:Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey J Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy

Title: CellFlux: Simulating Cellular Morphology Changes via Flow Matching

Abstract:
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlux models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects—a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlux generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlux enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research. Project page:https://yuhui-zh15.github.io/CellFlux/.

Paperid:1389 poster

Authors:Xiaoyan Feng, He Zhang, Yanjun Zhang, Leo Yu Zhang, Shirui Pan

Title: BiMark: Unbiased Multilayer Watermarking for Large Language Models

Abstract:
Recent advances in Large Language Models (LLMs) have raised urgent concerns about LLM-generated text authenticity, prompting regulatory demands for reliable identification mechanisms. Although watermarking offers a promising solution, existing approaches struggle to simultaneously achieve three critical requirements: text quality preservation, model-agnostic detection, and message embedding capacity, which are crucial for practical implementation. To achieve these goals, the key challenge lies in balancing the trade-off between text quality preservation and message embedding capacity. To address this challenge, we propose BiMark, a novel watermarking framework that achieves these requirements through three key innovations: (1) a bit-flip unbiased reweighting mechanism enabling model-agnostic detection, (2) a multilayer architecture enhancing detectability without compromising generation quality, and (3) an information encoding approach supporting multi-bit watermarking. Through theoretical analysis and extensive experiments, we validate that, compared to state-of-the-art multi-bit watermarking methods, BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity, and performs comparably to non-watermarked text on downstream tasks such as summarization and translation.

Paperid:1390 poster

Authors:Meng Chen, Hongwei Jia, Zechen Li, Wenzhen Jia, Kai Zhao, Hongjun Dai, Weiming Huang

Title: Cross-City Latent Space Alignment for Consistency Region Embedding

Abstract:
Learning urban region embeddings has substantially advanced urban analysis, but their typical focus on individual cities leads to disparate embedding spaces, hindering cross-city knowledge transfer and the reuse of downstream task predictors. To tackle this issue, we present Consistent Region Embedding (CoRE), a unified framework integrating region embedding learning with cross-city latent space alignment. CoRE first embeds regions from two cities into separate latent spaces, followed by the alignment of latent space manifolds and fine-grained individual regions from both cities. This ensures compatible and comparable embeddings within aligned latent spaces, enabling predictions of various socioeconomic indicators without ground truth labels by migrating knowledge from label-rich cities. Extensive experiments show CoRE outperforms competitive baselines, confirming its effectiveness for cross-city knowledge transfer via aligned latent spaces.

Paperid:1391 poster

Authors:Roman Klypa, Alberto Bietti, Sergei Grudinin

Title: BAnG: Bidirectional Anchored Generation for Conditional RNA Design

Abstract:
Designing RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Existing computational approaches require a substantial amount of experimentally determined RNA sequences for each specific protein or a detailed knowledge of RNA structure, restricting their utility in practice. To address this limitation, we develop RNA-BAnG, a deep learning-based model designed to generate RNA sequences for protein interactions without these requirements. Central to our approach is a novel generative method, Bidirectional Anchored Generation (BAnG), which leverages the observation that protein-binding RNA sequences often contain functional binding motifs embedded within broader sequence contexts. We first validate our method on generic synthetic tasks involving similar localized motifs to those appearing in RNAs, demonstrating its benefits over existing generative approaches. We then evaluate our model on biological sequences, showing its effectiveness for conditional RNA sequence design given a binding protein.

Paperid:1392 poster

Authors:Tony Shen, Seonghwan Seo, Ross Irwin, Kieran Didi, Simon Olsson, Woo Youn Kim, Martin Ester

Title: Compositional Flows for 3D Molecule and Synthesis Pathway Co-design

Abstract:
Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process. We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose. Our approach achieves state-of-the-art binding affinity and synthesizability on all 15 targets from the LIT-PCBA benchmark, and 4.2x improvement in sampling efficiency compared to 2D synthesis-based baseline. To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.42) and AiZynth success rate (36.1%) on the CrossDocked2020 benchmark.

Paperid:1393 poster

Authors:Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang

Title: Logarithmic Regret for Online KL-Regularized Reinforcement Learning

Abstract:
Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making (Xiong et al., 2024a; Xie et al., 2024; Zhao et al., 2024), these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm achieves an $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $\eta, N_{\mathcal R},T,d_{\mathcal R}$ denote the KL-regularization parameter, the cardinality of the reward function class, number of rounds, and the complexity of the reward function class. Furthermore, we extend our algorithm and analysis to reinforcement learning by developing a novel decomposition over transition steps and also obtain a similar logarithmic regret bound.

Paperid:1394 poster

Authors:Semyon Savkin, Eitan Porat, Or Ordentlich, Yury Polyanskiy

Title: NestQuant: nested lattice quantization for matrix products and LLMs

Abstract:
Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta's SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

Paperid:1395 poster

Authors:Zikai Zhou, Qizheng Zhang, Hermann Kumbong, Kunle Olukotun

Title: LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits

Abstract:
Fine-tuning large language models (LLMs) is increasingly costly as models scale to hundreds of billions of parameters, and even parameter-efficient fine-tuning (PEFT) methods like LoRA remain resource-intensive. We introduce LowRA, the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. LowRA optimizes fine-grained quantization—mapping, threshold selection, and precision assignment—while leveraging efficient CUDA kernels for scalable deployment. Extensive evaluations across 4 LLMs and 4 datasets show that LowRA achieves a superior performance–precision trade-off above 2 bits and remains accurate down to 1.15 bits, reducing memory usage by up to 50%. Our results highlight the potential of ultra-low-bit LoRA fine-tuning for resource-constrained environments.

Paperid:1396 poster

Authors:Tianyi Zhang, Junda Su, Aditya Desai, Oscar Wu, Zhaozhuo Xu, Anshumali Shrivastava

Title: Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation

Abstract:
Adapting pre-trained large language models (LLMs) is crucial but challenging due to their enormous size. Parameter-efficient fine-tuning (PEFT) techniques typically employ additive adapters applied to frozen model weights. To further reduce memory usage, model weights are often compressed through quantization. However, existing PEFT methods often yield suboptimal model quality because they rely on restrictive assumptions, such as low-rank constraints on adapters to limit the number of trainable parameters. We find that sketching, a popular data compression technique, can serve as an efficient LLM adaptation strategy while avoiding the low-rank assumption. We introduce SketchTune, a compressive adaptation strategy that compresses LLM weights into compact fine-tunable sketches, integrating compression and adaptation into a unified framework. This integration eliminates the need for complex two-path computation in existing PEFT techniques, enabling faster and more memory-efficient training and inference. SketchTune is supported by mathematical insights into matrix classes that are better approximated using sketching rather than low-rank methods. Our extensive evaluations with Llama and Mistral models demonstrate that SketchTune outperforms leading PEFT methods across diverse tasks while using substantially smaller base models and comparable trainable parameters. As a highlight, SketchTune outperforms LoRA, DoRA, and S2FT on commonsense and math benchmarks using 2.6-3.5$\times$ smaller base models and exceeds LoftQ in accuracy by 14.48% on GSM8K with 7.3$\times$ fewer trainable parameters.

Paperid:1397 poster

Authors:Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang

Title: Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias

Abstract:
Diagnosing deep neural networks (DNNs) through the eigenspectrum of weight matrices has been an active area of research in recent years. At a high level, eigenspectrum analysis of DNNs involves measuring the heavytailness of the empirical spectral densities (ESD) of weight matrices. It provides insight into how well a model is trained and can guide decisions on assigning better layer-wise training hyperparameters. In this paper, we address a challenge associated with such eigenspectrum methods: the impact of the aspect ratio of weight matrices on estimated heavytailness metrics. We demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in estimating heavytailness metrics, leading to inaccurate model diagnosis and layer-wise hyperparameter assignment. To overcome this challenge, we propose FARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We show that measuring the heavytailness of these submatrices with the fixed aspect ratio can effectively mitigate the aspect ratio bias. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, FARMS uniformly improves the accuracy of eigenspectrum analysis while enabling more effective layer-wise hyperparameter assignment in these application domains. In one of the LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model by 17.3% when compared with the state-of-the-art method.

Paperid:1398 poster

Authors:Vsevolod Viliuga, Leif Seute, Nicolas Wolf, Simon Wagner, Arne Elofsson, Jan Stühmer, Frauke Gräter

Title: Flexibility-conditioned protein structure design with flow matching

Abstract:
Recent advances in geometric deep learning and generative modeling have enabled the design of novel proteins with a wide range of desired properties. However, current state-of-the-art approaches are typically restricted to generating proteins with only static target properties, such as motifs and symmetries. In this work, we take a step towards overcoming this limitation by proposing a framework to condition structure generation on flexibility, which is crucial for key functionalities such as catalysis or molecular recognition. We first introduce BackFlip, an equivariant neural network for predicting per-residue flexibility from an input backbone structure. Relying on BackFlip, we propose FliPS, an SE(3)-equivariant conditional flow matching model that solves the inverse problem, that is, generating backbones that display a target flexibility profile. In our experiments, we show that FliPS is able to generate novel and diverse protein backbones with the desired flexibility, verified by Molecular Dynamics (MD) simulations.

Paperid:1399 poster

Authors:Zhiyong Wang, Chen Yang, John C.S. Lui, Dongruo Zhou

Title: Provable Zero-Shot Generalization in Offline Reinforcement Learning

Abstract:
In this work, we study offline reinforcement learning (RL) with zero-shot generalization property (ZSG), where the agent has access to an offline dataset including experiences from different environments, and the goal of the agent is to train a policy over the training environments which performs well on test environments without further interaction. Existing work showed that classical offline RL fails to generalize to new, unseen environments. We propose pessimistic empirical risk minimization (PERM) and pessimistic proximal policy optimization (PPPO), which leverage pessimistic policy evaluation to guide policy learning and enhance generalization. We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG. Our result serves as a first step in understanding the foundation of the generalization phenomenon in offline reinforcement learning.

Paperid:1400 poster

Authors:Andrea Testa, Søren Hauberg, Tamim Asfour, Leonel Rozo

Title: Geometric Contact Flows: Contactomorphisms for Dynamics and Control

Abstract:
Accurately modeling and predicting complex dynamical systems, particularly those involving force exchange and dissipation, is crucial for applications ranging from fluid dynamics to robotics, but presents significant challenges due to the intricate interplay of geometric constraints and energy transfer. This paper introduces Geometric Contact Flows (GFC), a novel framework leveraging Riemannian and Contact geometry as inductive biases to learn such systems. GCF constructs a latent contact Hamiltonian model encoding desirable properties like stability or energy conservation. An ensemble of contactomorphisms then adapts this model to the target dynamics while preserving these properties. This ensemble allows for uncertainty-aware geodesics that attract the system’s behavior toward the data support, enabling robust generalization and adaptation to unseen scenarios. Experiments on learning dynamics for physical systems and for controlling robots on interaction tasks demonstrate the effectiveness of our approach.

Paperid:1401 poster

Authors:Yiqing Li, Yewei Xia, Xiaofei Wang, Zhengming Chen, Liuhua Peng, Mingming Gong, Kun Zhang

Title: Extracting Rare Dependence Patterns via Adaptive Sample Reweighting

Abstract:
Discovering dependence patterns between variables from observational data is a fundamental issue in data analysis. However, existing testing methods often fail to detect subtle yet critical patterns that occur within small regions of the data distribution--patterns we term rare dependence. These rare dependencies obscure the true underlying dependence structure in variables, particularly in causal discovery tasks. To address this issue, we propose a novel testing method that combines kernel-based (conditional) independence testing with adaptive sample importance reweighting. By learning and assigning higher importance weights to data points exhibiting significant dependence, our method amplifies the patterns and can detect them successfully. Theoretically, we analyze the asymptotic distributions of the statistics in this method and show the uniform bound of the learning scheme. Furthermore, we integrate our tests into the PC algorithm, a constraint-based approach for causal discovery, equipping it to uncover causal relationships even in the presence of rare dependence. Empirical evaluation of synthetic and real-world datasets comprehensively demonstrates the efficacy of our method.

Paperid:1402 poster

Authors:Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, Peter Bartlett

Title: Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

Abstract:
We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush–Kuhn–Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed byJi & Telgarsky (2020).

Paperid:1403 poster

Authors:Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He

Title: AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Abstract:
Aligning large language models (LLMs) with human preferences requires balancing policy optimization with computational stability. While recent offline methods like DPO and SimPO bypass reinforcement learning’s complexity, they face critical limitations: DPO relies on static reference models that degrade with policy updates, and SimPO assumes a uniform target reward margin that ignores instance-wise preference strength. We propose AlphaDPO, an adaptive preference optimization framework that dynamically reparameterizes the reference distribution to address these issues. Our key innovation lies in an implicit reference model (\hat{\pi}{\text{ref}} \propto U(y|x)(\pi\theta/\pi_{\text{ref}})^\alpha), which interpolates between policy-driven specialization and uniform exploration while enabling instance-adaptive reward margins. Theoretically, we prove AlphaDPO implicitly controls sequential KL divergence between iterative policy updates, ensuring stability even with poorly calibrated reference models. Empirically, AlphaDPO achieves state-of-the-art performance on AlpacaEval 2 (58.7% LC win rate) and Arena-Hard (35.7% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training. Our work establishes adaptive reference reparameterization as a principled mechanism for preference optimization.

Paperid:1404 poster

Authors:Tianci Liu, Ruirui Li, Zihan Dong, Hui Liu, Xianfeng Tang, Qingyu Yin, Linjun Zhang, Haoyu Wang, Jing Gao

Title: Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing

Abstract:
Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates.Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: \textit{heterogeneous token overfitting} (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates. To tackle this, we propose {OVERTONE}, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.

Paperid:1405 poster

Authors:Richeng Jin, Huaiyu Dai

Title: Noisy SIGNSGD Is More Differentially Private Than You (Might) Think

Abstract:
The prevalent distributed machine learning paradigm faces two critical challenges: communication efficiency and data privacy. SIGNSGD provides a simple-to-implement approach with improved communication efficiency by requiring workers to share only the signs of the gradients. However, it fails to converge in the presence of data heterogeneity, and a simple fix is to add Gaussian noise before taking the signs, which leads to the Noisy SIGNSGD algorithm that enjoys competitive performance while significantly reducing the communication overhead. Existing results suggest that Noisy SIGNSGD with additive Gaussian noise has the same privacy guarantee as classic DP-SGD due to the post-processing property of differential privacy, and logistic noise may be a good alternative to Gaussian noise when combined with the sign-based compressor. Nonetheless, discarding the magnitudes in Noisy SIGNSGD leads to information loss, which may intuitively amplify privacy. In this paper, we make this intuition rigorous and quantify the privacy amplification of the sign-based compressor. Particularly, we analytically show that Gaussian noise leads to a smaller estimation error than logistic noise when combined with the sign-based compressor and may be more suitable for distributed learning with heterogeneous data. Then, we further establish the convergence of Noisy SIGNSGD. Finally, extensive experiments are conducted to validate the theoretical results.

Paperid:1406 poster

Authors:Tim Steinert, David Ginsbourger, August Lykke-Møller, Ove Christiansen, Henry Moss

Title: Integration-free Kernels for Equivariant Gaussian Process Modelling

Abstract:
We study the incorporation of equivariances into vector-valued GPs and more general classes of random field models. While kernels guaranteeing equivariances have been investigated previously, their evaluation is often computationally prohibitive due to required integrations over the involved groups. In this work, we provide a kernel characterization of stochastic equivariance for centred second-order vector-valued random fields and we construct integration-free equivariant kernels based on the notion of fundamental regions of group actions. We establish data-efficient and computationally lightweight GP models for velocity fields and molecular electric dipole moments and demonstrate that proposed integration-free kernels may also be leveraged to extract equivariant components from data.

Paperid:1407 poster

Authors:Ayman Chaouki, Jesse Read, Albert Bifet

Title: Branches: Efficiently Seeking Optimal Sparse Decision Trees via AO*

Abstract:
Decision Tree (DT) Learning is a fundamental problem in Interpretable Machine Learning, yet it poses a formidable optimisation challenge. Practical algorithms have recently emerged, primarily leveraging Dynamic Programming and Branch & Bound. However, most of these approaches rely on a Depth-First-Search strategy, which is inefficient when searching for DTs at high depths and requires the definition of a maximum depth hyperparameter. Best-First-Search was also employed by other methods to circumvent these issues. The downside of this strategy is its higher memory consumption, as such, it has to be designed in a fully efficient manner that takes full advantage of the problem's structure. We formulate the problem as an AND/OR graph search which we solve with a novel AO*-type algorithm called Branches. We prove both optimality and complexity guarantees for Branches and we show that it is more efficient than the state of the art theoretically and on a variety of experiments. Furthermore, Branches supports non-binary features unlike the other methods, we show that this property can further induce larger gains in computational efficiency.

Paperid:1408 poster

Authors:Jun-Peng Jiang, Tao Zhou, De-Chuan Zhan, Han-Jia Ye

Title: Compositional Condition Question Answering in Tabular Understanding

Abstract:
Multimodal Large Language Models (MLLMs) for tabular understanding have made significant progress in tasks such as financial report analysis and public data tests. However, our comprehensive analysis shows that these models are still limited in certain simple scenarios, particularly when handling compositional conditions in QA. Further investigation reveals that the poor performance can be attributed to two main challenges: the visual encoder's inability to accurately recognize the content of a row, and the model's tendency to overlook conditions in the question. To address these, we introduce a new Compositional Condition Tabular Understanding method, called {\sc CoCoTab}. Specifically, to capture the structural relationships within tables, we enhance the visual encoder with additional row and column patches. Moreover, we introduce the conditional tokens between the visual patches and query embeddings, ensuring the model focuses on relevant parts of the table according to the conditions specified in the query. Additionally, we also introduce the Massive Multimodal Tabular Understanding (MMTU) benchmark, which comprehensively assesses the full capabilities of MLLMs in tabular understanding. Our proposed method achieves state-of-the-art performance on both existing tabular understanding benchmarks and MMTU. Our code can be available at \url{https://github.com/LAMDA-Tabular/MMTU}.

Paperid:1409 poster

Authors:Weihuang Zheng, Jiashuo Liu, Jiaxing Li, Jiayun Wu, Peng Cui, Youyong Kong

Title: Topology-Aware Dynamic Reweighting for Distribution Shifts on Graph

Abstract:
Graph Neural Networks (GNNs) are widely used for node classification tasks but often fail to generalize when training and test nodes come from different distributions, limiting their practicality. To address this challenge, recent approaches have adopted invariant learning and sample reweighting techniques from the out-of-distribution (OOD) generalization field. However, invariant learning-based methods face difficulties when applied to graph data, as they rely on the impractical assumption of obtaining real environment labels and strict invariance, which may not hold in real-world graph structures. Moreover, current sample reweighting methods tend to overlook topological information, potentially leading to suboptimal results. In this work, we introduce the Topology-Aware Dynamic Reweighting (TAR) framework to address distribution shifts by leveraging the inherent graph structure. TAR dynamically adjusts sample weights through gradient flow on the graph edges during training. Instead of relying on strict invariance assumptions, we theoretically prove that our method is able to provide distributional robustness, thereby enhancing the out-of-distribution generalization performance on graph data. Our framework's superiority is demonstrated through standard testing on extensive node classification OOD datasets, exhibiting marked improvements over existing methods.

Paperid:1410 poster

Authors:Filippo Lazzati, Alberto Maria Metelli

Title: Learning Utilities from Demonstrations in Markov Decision Processes

Abstract:
Although it is well-known that humans commonly engage inrisk-sensitivebehaviors in the presence of stochasticity, most Inverse Reinforcement Learning (IRL) models assume arisk-neutralagent. As such, beyond(i)introducing model misspecification,(ii)they do not permit direct inference of the risk attitude of the observed agent, which can be useful in many applications. In this paper, we propose a novel model of behavior to cope with these issues. By allowing for risk sensitivity, our model alleviates(i), and by explicitly representing risk attitudes through (learnable)utilityfunctions, it solves(ii). Then, we characterize the partial identifiability of an agent’s utility under the new model and note that demonstrations from multiple environments mitigate the problem. We devise two provably-efficient algorithms for learning utilities in a finite-data regime, and we conclude with some proof-of-concept experiments to validatebothour model and our algorithms.

Paperid:1411 poster

Authors:Ziyuan Zhou, Guanjun Liu, Mengchu Zhou, Weiran Guo

Title: Robust Multi-Agent Reinforcement Learning with Stochastic Adversary

Abstract:
The performance of models trained by Multi-Agent Reinforcement Learning (MARL) is sensitive to perturbations in observations, lowering their trustworthiness in complex environments. Adversarial training is a valuable approach to enhance their performance robustness. However, existing methods often overfit to adversarial perturbations of observations and fail to incorporate prior information about the policy adopted by their protagonist agent, i.e., the primary one being trained. To address this important issue, this paper introduces Adversarial Training with Stochastic Adversary (ATSA), where the proposed adversary is trained online alongside the protagonist agent. The former consists of Stochastic Director (SDor) and SDor-guided generaTor (STor). SDor performs policy perturbations by minimizing the expected team reward of protagonists and maximizing the entropy of its policy, while STor generates adversarial perturbations of observations by following SDor's guidance. We prove that SDor's soft policy converges to a global optimum according to factorized maximum-entropy MARL and leads to the optimal adversary. This paper also introduces an SDor-STor loss function to quantify the difference between a) perturbations in the agent's policy and b) those advised by SDor. We evaluate our ATSA on StarCraft II tasks and autonomous driving scenarios, demonstrating that a) it is robust against diverse perturbations of observations while maintaining outstanding performance in perturbation-free environments, and b) it outperforms the state-of-the-art methods.

Paperid:1412 poster

Authors:Omayma Mahjoub, Sasha Abramowitz, Ruan John de Kock, Wiem Khlifi, Simon Verster Du Toit, Jemma Daniel, Louay Ben Nessir, Louise Beyers, Juan Claude Formanek, Liam Clark, Arnu Pretorius

Title: Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Abstract:
As multi-agent reinforcement learning (MARL) progresses towards solving larger and more complex problems, it becomes increasingly important that algorithms exhibit the key properties of (1) strong performance, (2) memory efficiency, and (3) scalability. In this work, we introduce Sable, a performant, memory-efficient, and scalable sequence modelling approach to MARL. Sable works by adapting the retention mechanism in Retentive Networks (Sun et al., 2023) to achieve computationally efficient processing of multi-agent observations with long context memory for temporal reasoning. Through extensive evaluations across six diverse environments, we demonstrate howSable is able to significantly outperform existing state-of-the-art methods in a large number of diverse tasks (34 out of 45 tested). Furthermore, Sable maintains performance as we scale the number of agents, handling environments with more than a thousand agents while exhibiting a linear increase in memory usage. Finally, we conduct ablation studies to isolate the source of Sable's performance gains and confirm its efficient computational memory usage.All experimental data, hyperparameters, and code for a frozen version of Sable used in this paper are available on our website:https://sites.google.com/view/sable-marl.An improved and maintained version of Sable is available in Mava:https://github.com/instadeepai/Mava.

Paperid:1413 poster

Authors:Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, ChengXiang Zhai, Enhong Chen

Title: What Makes In-context Learning Effective for Mathematical Reasoning

Abstract:
Owing to the capability of in-context learning, large language models (LLMs) have shown impressive performance across diverse mathematical reasoning benchmarks. However, we find that few-shot demonstrations can sometimes bring negative performance and their effectiveness on LLMs' reasoning abilities remains unreliable. To this end, in this paper, we aim to theoretically analyze the impact of in-context demonstrations on LLMs' reasoning performance. We prove that the reasoning efficacy (measured by empirical prediction loss) can be bounded by an \emph{LLM-oriented semantic similarity} and an \emph{inference stability of demonstrations}, which is general for both one-shot and few-shot scenarios. Based on this finding, we propose a straightforward, generalizable, and low-complexity demonstration selection method named LMS3. It facilitates to select the most pertinent samples for different LLMs and includes a novel demonstration rejection mechanism to automatically filter out samples that are unsuitable for few-shot learning. Through experiments on three representative benchmarks, two LLM backbones, and multiple few-shot settings, we verify that our LMS3 has superiority and achieves consistent improvements on all datasets, which existing methods have been unable to accomplish. Our code is available at \url{https://github.com/Ljyustc/LMS3}.

Paperid:1414 poster

Authors:Dong-Hee Shin, Young-Han Son, Hyun Jung Lee, Deok-Joong Lee, Tae-Eui Kam

Title: Offline Model-based Optimization for Real-World Molecular Discovery

Abstract:
Molecular discovery has attracted significant attention in scientific fields for its ability to generate novel molecules with desirable properties. Although numerous methods have been developed to tackle this problem, most rely on an online setting that requires repeated online evaluation of candidate molecules using the oracle. However, in real-world molecular discovery, the oracle is often represented by wet-lab experiments, making this online setting impractical due to the significant time and resource demands. To fill this gap, we propose the Molecular Stitching (MolStitch) framework, which utilizes a fixed offline dataset to explore and optimize molecules without the need for repeated oracle evaluations. Specifically, MolStitch leverages existing molecules from the offline dataset to generate novel `stitched molecules' that combine their desirable properties. These stitched molecules are then used as training samples to fine-tune the generative model using preference optimization techniques. Experimental results on various offline multi-objective molecular optimization problems validate the effectiveness of MolStitch. The source code is available online.

Paperid:1415 poster

Authors:Haihan Zhang, Weicheng Lin, Yuanshi Liu, Cong Fang

Title: Learning Curves of Stochastic Gradient Descent in Kernel Regression

Abstract:
This paper considers a canonical problem in kernel regression: how good are the model performances when it is trained by the popular online first-order algorithms, compared to the offline ones, such as ridge and ridgeless regression? In this paper, we analyze the foundational single-pass Stochastic Gradient Descent (SGD) in kernel regression under source condition where the optimal predictor can even not belong to the RKHS, i.e. the model is misspecified. Specifically, we focus on the inner product kernel over the sphere and characterize the exact orders of the excess risk curves under different scales of sample sizes $n$ concerning the input dimension $d$. Surprisingly, we show that SGD achieves min-max optimal rates up to constants among all the scales, $without$ suffering the saturation, a prevalent phenomenon observed in (ridge) regression, except when the model is highly misspecified and the learning is in a final stage where $n\gg d^\gamma$ with any constant $\gamma >0$. The main reason for SGD to overcome the curse of saturation is the exponentially decaying step size schedule, a common practice in deep neural network training. As a byproduct, we provide the $first$ provable advantage of the scheme over the iterative averaging method in the common setting.

Paperid:1416 poster

Authors:Amin Nejatbakhsh, Yixin Wang

Title: Identifying Neural Dynamics Using Interventional State Space Models

Abstract:
Neural circuits produce signals that are complex and nonlinear. To facilitate the understanding of neural dynamics, a popular approach is to fit state space models (SSM) to the data and analyze the dynamics of the low-dimensional latent variables. Despite the power of SSM to explain the dynamics of neural circuits, these models have been shown to merely capture statistical associations in the data and cannot be causally interpreted. Therefore, an important research problem is to build models that can predict neural dynamics under causal manipulations. Here, we propose interventional state-space models (iSSM), a class of causal models that can predict neural responses to novel perturbations. We draw on recent advances in causal dynamical systems and present theoretical results for the identifiability of iSSM. In simulations of the motor cortex, we show that iSSM can recover the true latents and the underlying dynamics. In addition, we illustrate two applications of iSSM in biological datasets. First, we applied iSSM to a dataset of calcium recordings from ALM neurons in mice during photostimulation. Second, we applied iSSM to a dataset of electrophysiological recordings from macaque dlPFC during micro-stimulation. In both cases, we show that iSSM outperforms SSM and results in identifiable parameters. The code is available athttps://github.com/amin-nejat/issm.

Paperid:1417 poster

Authors:Heng Dong, Kefei Duan, Chongjie Zhang

Title: Enhancing Decision-Making of Large Language Models via Actor-Critic

Abstract:
Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks, yet they encounter challenges in complex decision-making scenarios that require long-term reasoning and alignment with high-level objectives. Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes, leading to sub-optimal decisions. This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations in a principled and scalable way. Our approach addresses two key challenges: (1) extracting robust action evaluations by computing Q-values via token logits associated with positive/negative outcomes, enhanced by future trajectory rollouts and reasoning; and (2) enabling efficient policy improvement through a gradient-free mechanism. Experiments across diverse environments -- including high-level decision-making (ALFWorld), low-level action spaces (BabyAI-Text), and large action spaces (WebShop) -- demonstrate the framework’s generality and superiority over state-of-the-art methods. Notably, our approach achieves competitive performance using 7B/8B parameter LLMs, even outperforming baseline methods employing GPT-4 in complex tasks. These results underscore the potential of integrating structured policy optimization with LLMs�?intrinsic knowledge to advance decision-making capabilities in multi-step environments.

Paperid:1418 poster

Authors:Lusen Zhao, Zihan Huang, Jianhao Ding, Zhaofei Yu

Title: TTFSFormer: A TTFS-based Lossless Conversion of Spiking Transformer

Abstract:
ANN-to-SNN conversion has emerged as a key approach to train Spiking Neural Networks (SNNs), particularly for Transformer architectures, as it maps pre-trained ANN parameters to SNN equivalents without requiring retraining, thereby preserving ANN accuracy while eliminating training costs. Among various coding methods used in ANN-to-SNN conversion, time-to-first-spike (TTFS) coding, which allows each neuron to at most one spike, offers significantly lower energy consumption. However, while previous TTFS-based SNNs have achieved comparable performance with convolutional ANNs, the attention mechanism and nonlinear layers in Transformer architectures remains a challenge by existing SNNs with TTFS coding. This paper proposes a new neuron structure for TTFS coding that expands its representational range and enhances the capability to process nonlinear functions, along with detailed designs of nonlinear neurons for different layers in Transformer. Experimental results on different models demonstrate that our proposed method can achieve high accuracy with significantly lower energy consumption. To the best of our knowledge, this is the first work to focus on converting Transformer to SNN with TTFS coding.

Paperid:1419 poster

Authors:Junbo Li, Zhangyang Wang, qiang liu

Title: PIPA: Preference Alignment as Prior-Informed Statistical Estimation

Abstract:
Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a $3\sim10%$ performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.

Paperid:1420 poster

Authors:Jessica Dai, Nika Haghtalab, Eric Zhao

Title: Learning With Multi-Group Guarantees For Clusterable Subpopulations

Abstract:
A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this view, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing per-subgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all components in proportion to their relative likelihood of having been drawn from each component. Using online calibration as a case study, we study a multi-objective algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve an $O(T^{1/2})$ rate even when the subpopulations are not well-separated. In comparison, the more natural cluster-then-predict approach that first recovers the structure of the subpopulations and then makes predictions suffers from a $O(T^{2/3})$ rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.

Paperid:1421 poster

Authors:Yaopei Zeng, Yuanpu Cao, Bochuan Cao, Yurui Chang, Jinghui Chen, Lu Lin

Title: AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion Models

Abstract:
Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.

Paperid:1422 poster

Authors:Haotian Fu, Yixiang Sun, Michael Littman, George Konidaris

Title: Knowledge Retention in Continual Model-Based Reinforcement Learning

Abstract:
We propose DRAGO, a novel approach for continual model-based reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components:Synthetic Experience Rehearsal, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, andRegaining Memories Through Exploration, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.

Paperid:1423 poster

Authors:Lahav Dabah, Tom Tirer

Title: On Temperature Scaling and Conformal Prediction of Deep Classifiers

Abstract:
In many classification applications, the prediction of a deep neural network (DNN) based classifier needs to be accompanied by some confidence indication. Two popular approaches for that aim are: 1)Calibration: modifies the classifier's softmax values such that the maximal value better estimates the correctness probability; and 2)Conformal Prediction(CP): produces a prediction set of candidate labels that contains the true label with a user-specified probability, guaranteeing marginal coverage but not, e.g., per class coverage. In practice, both types of indications are desirable, yet, so far the interplay between them has not been investigated. Focusing on the ubiquitousTemperature Scaling(TS) calibration, we start this paper with an extensive empirical study of its effect on prominent CP methods. We show that while TS calibration improves the class-conditional coverage of adaptive CP methods, surprisingly, it negatively affects their prediction set sizes. Motivated by this behavior, we explore the effect of TS on CPbeyond its calibration applicationand reveal an intriguing trend under which it allows to trade prediction set size and conditional coverage of adaptive CP methods. Then, we establish a mathematical theory that explains the entire non-monotonic trend. Finally, based on our experiments and theory, we offer guidelines for practitioners to effectively combine adaptive CP with calibration, aligned with user-defined goals.

Paperid:1424 poster

Authors:David Clancy Jr., Hanbaek Lyu, Sebastien Roch

Title: Sample Complexity of Branch-length Estimation by Maximum Likelihood

Abstract:
We consider the branch-length estimation problem on a bifurcating tree: a character evolves along the edges of a binary tree according to a two-state symmetric Markov process, and we seek to recover the edge transition probabilities from repeated observations at the leaves. This problem arises in phylogenetics, and is related to latent tree graphical model inference. In general, the log-likelihood function is non-concave and may admit many critical points. Nevertheless, simple coordinate maximization has been known to perform well in practice, defying the complexity of the likelihood landscape. In this work, we provide the first theoretical guarantee as to why this might be the case. We show that deep inside the Kesten-Stigum reconstruction regime, provided with polynomially manymsamples (assuming the tree is balanced), there exists a universal parameter regime (independent of the size of the tree) where the log-likelihood function is strongly concave and smooth with high probability. On this high-probability likelihood landscape event, we show that the standard coordinate maximization algorithm converges exponentially fast to the maximum likelihood estimator, which is withinO(1/m)from the true parameter, provided a sufficiently close initial point.

Paperid:1425 poster

Authors:Hanzhang Wang, Qingyuan Ma

Title: Textural or Textual: How Vision-Language Models Read Text in Images

Abstract:
Typographic attacks are often attributed to the ability of multimodal pre-trained models to fuse textual semantics into visual representations, yet the mechanisms and locus of such interference remain unclear. We examine whether such models genuinely encode textual semantics or primarily rely on texture-based visual features. To disentangle orthographic form from meaning, we introduce the ToT dataset, which includes controlled word pairs that either share semantics with distinct appearances (synonyms) or share appearance with differing semantics (paronyms). A layer-wise analysis of Intrinsic Dimension (ID) reveals that early layers exhibit competing dynamics between orthographic and semantic representations. In later layers, semantic accuracy increases as ID decreases, but this improvement largely stems from orthographic disambiguation. Notably, clear semantic differentiation emerges only in the final block, challenging the common assumption that semantic understanding is progressively constructed across depth. These findings reveal how current vision-language models construct text representations through texture-dependent processes, prompting a reconsideration of the gap between visual perception and semantic understanding. The code is available at:https://github.com/Ovsia/Textural-or-Textual

Paperid:1426 poster

Authors:YOUHE JIANG, Fangcheng Fu, Xiaozhe Yao, Guoliang HE, Xupeng Miao, Ana Klimovic, Bin CUI, Binhang Yuan, Eiko Yoneki

Title: Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

Abstract:
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via mixed-integer linear programming, aiming at deducing the most cost-efficient serving plan under the constraints of price budget and real-time GPU availability. Remarkably, our approach effectively outperforms homogeneous and heterogeneous baselines under a wide array of scenarios, covering diverse workload traces, varying GPU availablilities, and multi-model serving. This casts new light on more accessible and efficient LLM serving over heterogeneous cloud resources.

Paperid:1427 poster

Authors:Hao-Zhe Tan, Zhi Zhou, Yu-feng Li, Lan-Zhe Guo

Title: Vision-Language Model Selection and Reuse for Downstream Adaptation

Abstract:
Pre-trained Vision-Language Models (VLMs) are becoming increasingly popular across various visual tasks, and several open-sourced VLM variants have been released. However, selecting the best-performing pre-trained VLM for a specific downstream task is challenging since no single VLM can achieve promising performance on all downstream tasks, and evaluating all available VLMs is impossible due to time and data limitations. To address this problem, this paper proposes a novel paradigm to select and reuse VLM for downstream tasks, calledModelLabelLearning (MLL). The proposal contains three key modules:model labeling, which assigns labels to each VLM to describe their specialty and utility;model selection, which matches the requirements of the target task with model labels; andmodel reuse, which applies selected VLMs to the target task in an ensemble manner. The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent and the ability could grow with the number of candidate VLMs. We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs.

Paperid:1428 poster

Authors:Yuchen Guan, Runxi Cheng, Kang Liu, Chun Yuan

Title: Enhancing Logits Distillation with Plug&Play Kendall'sτRanking Loss

Abstract:
Knowledge distillation typically minimizes the Kullback–Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall’sτcoefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.

Paperid:1429 poster

Authors:Haozhao Wang, Shengyu Wang, Jiaming Li, Hao Ren, XINGSHUO HAN, Wenchao Xu, Shangwei Guo, Tianwei Zhang, Ruixuan Li

Title: BSemiFL: Semi-supervised Federated Learning via a Bayesian Approach

Abstract:
Semi-supervised Federated Learning (SSFL) is a promising approach that allows clients to collaboratively train a global model in the absence of their local data labels. The key step of SSFL is the re-labeling where each client adopts two types of available models, namely global and local models, to re-label the local data. While various technologies such as using the global model or the average of two models have been proposed to conduct the re-labeling step, little literature delves deeply into the performance dominance and limitations of the two models. In this paper, we first theoretically and empirically demonstrate that the local model achieves higher re-labeling accuracy over local data while the global model can progressively improve the re-labeling performance by introducing the extra data knowledge of other clients. Based on these findings, we propose BSemiFL which re-labels the local data through the collaboration between the local and global model in a Bayesian approach. Specifically, to re-label any given local sample, BSemiFL first uses Bayesian inference to assess the closeness of the local/global model to the sample. Then, it applies a weighted combination of their pseudo labels, using the closeness as the weights. Theoretical analysis shows that the labeling error of our method is smaller than that of simply using the global model, the local model, or their simple average. Experimental results show that BSemiFL improves the performance by up to9.8as compared to state-of-the-art methods.

Paperid:1430 poster

Authors:Yuji Wang, Zehua Chen, Chen Xiaoyu, Yixiang Wei, Jun Zhu, Jianfei Chen

Title: FrameBridge: Improving Image-to-Video Generation with Bridge Models

Abstract:
Diffusion models have achieved remarkable progress on image-to-video (I2V) generation, while their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality. In this work, we present FrameBridge. By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image and improve the consistency between the generation process and I2V task. Moreover, we propose two novel techniques toward the two popular settings of training I2V models, respectively. Firstly, we propose SNR-Aligned Fine-tuning (SAF), making the first attempt to fine-tune a diffusion model to a bridge model and, therefore, allowing us to utilize the pre-trained diffusion-based text-to-video (T2V) models. Secondly, we propose neural prior, further improving the synthesis quality of FrameBridge when training from scratch. Experiments conducted on WebVid-2M and UCF-101 demonstrate the superior quality of FrameBridge in comparison with the diffusion counterpart (zero-shot FVD 95 vs. 192 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101), and the advantages of our proposed SAF and neural prior for bridge-based I2V models. The project page:https://framebridge-icml.github.io/

Paperid:1431 poster

Authors:Timothy Doyeon Kim, Thomas Zhihao Luo, Tankut Can, Kamesh Krishnamurthy, Jonathan W. Pillow, Carlos D Brody

Title: Flow-field inference from neural data using deep recurrent networks

Abstract:
Neural computations underlying processes such as decision-making, working memory, and motor control are thought to emerge from neural population dynamics. But estimating these dynamics remains a significant challenge. Here we introduce Flow-field Inference from Neural Data using deep Recurrent networks (FINDR), an unsupervised deep learning method for inferring low-dimensional, nonlinear, stochastic dynamics underlying neural population activity. Using spike train data from frontal brain regions of rats performing an auditory decision-making task, we demonstrate that FINDR performs competitively with existing methods in capturing the heterogeneous responses of individual neurons. When trained to disentangle task-relevant and irrelevant activity, FINDR uncovers interpretable low-dimensional dynamics. These dynamics can be visualized as flow fields and attractors, enabling direct tests of attractor-based theories of neural computation. We suggest FINDR as a powerful method for revealing the low-dimensional task-relevant dynamics of neural populations and their associated computations.

Paperid:1432 poster

Authors:Itay Yona, Ilia Shumailov, Jamie Hayes, Yossi Gandelsman

Title: Interpreting the Repeated Token Phenomenon in Large Language Models

Abstract:
Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents avulnerability, allowing even end users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of "attention sinks", an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other nonrepeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the overall performance of the model. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.

Paperid:1433 poster

Authors:Shilong Tao, Zhe Feng, Haonan Sun, Zhanxing Zhu, Yunhuai Liu

Title: Unisoma: A Unified Transformer-based Solver for Multi-Solid Systems

Abstract:
Multi-solid systems are foundational to a wide range of real-world applications, yet modeling their complex interactions remains challenging. Existing deep learning methods predominantly rely on implicit modeling, where the factors influencing solid deformation are not explicitly represented but are instead indirectly learned. However, as the number of solids increases, these methods struggle to accurately capture intricate physical interactions. In this paper, we introduce a novel explicit modeling paradigm that incorporates factors influencing solid deformation through structured modules. Specifically, we present Unisoma, a unified and flexible Transformer-based model capable of handling variable numbers of solids. Unisoma directly captures physical interactions using contact modules and adaptive interaction allocation mechanism, and learns the deformation through a triplet relationship. Compared to implicit modeling techniques, explicit modeling is more well-suited for multi-solid systems with diverse coupling patterns, as it enables detailed treatment of each solid while preventing information blending and confusion. Experimentally, Unisoma achieves consistent state-of-the-art performance across seven well-established datasets and two complex multi-solid tasks. Code is avaiable athttps://github.com/therontau0054/Unisoma.

Paperid:1434 poster

Authors:Peter Pao-Huang, Mitchell Black, Xiaojie Qiu

Title: Geometric Generative Modeling with Noise-Conditioned Graph Networks

Abstract:
Generative modeling of graphs with spatial structure is essential across many applications from computer graphics to spatial genomics. Recent flow-based generative models have achieved impressive results by gradually adding and then learning to remove noise from these graphs. Existing models, however, use graph neural network architectures that are independent of the noise level, limiting their expressiveness. To address this issue, we introduceNoise-Conditioned Graph Networks(NCGNs), a class of graph neural networks that dynamically modify their architecture according to the noise level during generation. Our theoretical and empirical analysis reveals that as noise increases, (1) graphs require information from increasingly distant neighbors and (2) graphs can be effectively represented at lower resolutions. Based on these insights, we develop Dynamic Message Passing (DMP), a specific instantiation of NCGNs that adapts both the range and resolution of message passing to the noise level. DMP consistently outperforms noise independent architectures on a variety of domains including3D point clouds, spatiotemporal transcriptomics, and images.

Paperid:1435 poster

Authors:Yujia Yin, Tianyi Qu, Zihao Wang, Yifan Chen

Title: A Recipe for Causal Graph Regression: Confounding Effects Revisited

Abstract:
Through recognizing causal subgraphs, causal graph learning (CGL) has risen to be a promising approach for improving the generalizability of graph neural networks under out-of-distribution (OOD) scenarios. However, the empirical successes of CGL techniques are mostly exemplified in classification settings, while regression tasks, a more challenging setting in graph learning, are overlooked. We thus devote this work to tackling causal graph regression (CGR); to this end we reshape the processing of confounding effects in existing CGL studies, which mainly deal with classification. Specifically, we reflect on the predictive power of confounders in graph-level regression, and generalize classification-specific causal intervention techniques to regression through a lens of contrastive learning. Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. The model implementation and the code are provided onhttps://github.com/causal-graph/CGR.

Paperid:1436 poster

Authors:Xi Wang, Laurence Aitchison

Title: How to set AdamW's weight decay as you scale model and dataset size

Abstract:
The scaling of the optimal AdamW weight decay hyperparameter with model and dataset size is critical as we seek to build larger models, but is poorly understood. We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size. In particular, the key hyperparameter for an exponential moving average is the EMA timescale. Intuitively, the EMA timescale can be understood as the number of recent iterations the EMA averages over. We find that the optimal timescale, measured in epochs, is roughly constant as we change model and dataset size. Moreover, given a learning rate, there is a one-to-one mapping from the EMA timescale to the weight decay hyperparameter. Thus, if the optimal EMA timescale is constant, that implies that as the dataset size increases, the optimal weight decay should fall and as the model size increases, the optimal weight decay should increase (if we follow the muP recommendation for scaling the learning rate). We validate these scaling rules on ResNet-18 and Vision Transformers trained on CIFAR-10 and ImageNet, and on NanoGPT pre-training on OpenWebText. Finally, we found that as training progresses, muP's learning rate scaling breaks down for AdamW unless weight decay is scaled appropriately.

Paperid:1437 poster

Authors:Marcin Sendera, Łukasz Struski, Kamil Książek, Kryspin Musiol, Jacek Tabor, Dawid Damian Rymarczyk

Title: SEMU: Singular Value Decomposition for Efficient Machine Unlearning

Abstract:
While the capabilities of generative foundational models have advanced rapidly in recent years, methods to prevent harmful and unsafe behaviors remain underdeveloped. Among the pressing challenges in AI safety, machine unlearning (MU) has become increasingly critical to meet upcoming safety regulations. Most existing MU approaches focus on altering the most significant parameters of the model. However, these methods often require fine-tuning substantial portions of the model, resulting in high computational costs and training instabilities, which are typically mitigated by access to the original training dataset.In this work, we address these limitations by leveraging Singular Value Decomposition (SVD) to create a compact, low-dimensional projection that enables the selective forgetting of specific data points. We propose Singular Value Decomposition for Efficient Machine Unlearning (SEMU), a novel approach designed to optimize MU in two key aspects. First, SEMU minimizes the number of model parameters that need to be modified, effectively removing unwanted knowledge while making only minimal changes to the model's weights. Second, SEMU eliminates the dependency on the original training dataset, preserving the model's previously acquired knowledge without additional data requirements.Extensive experiments demonstrate that SEMU achieves competitive performance while significantly improving efficiency in terms of both data usage and the number of modified parameters.

Paperid:1438 poster

Authors:Akash Kumar

Title: The Complexity of Learning Sparse Superposed Features with Feedback

Abstract:
The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative triplet comparisons. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.

Paperid:1439 poster

Authors:Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig

Title: Pre-training Auto-regressive Robotic Models with 4D Representations

Abstract:
Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, anAuto-regressiveRoboticModel that leverages low-level4DRepresentations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.

Paperid:1440 poster

Authors:Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji

Title: Distillation of Discrete Diffusion through Dimensional Correlations

Abstract:
Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in language) mainly due to the computational cost of processing high-dimensional joint distributions. In this paper, (i) we propose "mixture" models for discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and (ii) we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: First, conventional models with element-wise independence can well approximate the data distribution, but essentially requiremany sampling steps. Second, our loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains. The code used in the paper is available athttps://github.com/sony/di4c.

Paperid:1441 poster

Authors:Qi Wei, Shuo He, Enneng Yang, Tingcong Liu, Haobo Wang, Lei Feng, Bo An

Title: Representation Surgery in Model Merging with Probabilistic Modeling

Abstract:
Model merging aims to achieve multitask performance by merging multiple expert models without the need to access the raw training data. Recent research identified the \textit{representation bias} of model merging, characterized by a discrepancy in the representation distribution between the merged and individual models, hindering the performance of model merging methods. To mitigate the representation bias, a task-specific MLP, Surgery, was built to model the bias that is subsequently decreased on the merged representation. However, this strategy is still suboptimal due to the limited modeling capability within the deterministic manner. To address this issue, we present ProbSurgery, a probabilistic module specifically designed to accurately model the representation bias. This module generates an embedding distribution for each sample and outputs the representation bias through a sampling process. ProbSurgery offers superior representational capacity by naturally handling the uncertainty resulting from parameter interference of merging multiple models. Besides, we provide a theoretical analysis to reveal the advance of the probabilistic manner and propose an extension of ProSurgery for adapting to the task-sharing setting. Extensive experiments verify the effectiveness of ProbSurgery for representation surgery while maintaining generalization capabilities in real-world scenarios, including out-of-distribution and domain shift challenges.

Paperid:1442 poster

Authors:Junhong Zhang, Zhihui Lai

Title: Joker: Joint Optimization Framework for Lightweight Kernel Machines

Abstract:
Kernel methods are powerful tools for nonlinear learning with well-established theory. The scalability issue has been their long-standing challenge. Despite the existing success, there are two limitations in large-scale kernel methods: (i) The memory overhead is too high for users to afford; (ii) existing efforts mainly focus on kernel ridge regression (KRR), while other models lack study. In this paper, we proposeJoker, a joint optimization framework for diverse kernel models, including KRR, logistic regression, and support vector machines. We design a dual block coordinate descent method with trust region (DBCD-TR) and adopt kernel approximation with randomized features, leading to low memory costs and high efficiency in large-scale learning. Experiments show thatJokersaves up to 90% memory but achieves comparable training time and performance (or even better) than the state-of-the-art methods.

Paperid:1443 poster

Authors:Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng

Title: TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

Abstract:
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TokenSwift, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model’s inherent quality. Experimental results demonstrate that TokenSwift achieves over3×speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TokenSwift as a scalable and effective solution at unprecedented lengths.

Paperid:1444 poster

Authors:Chunming He, Rihan Zhang, Fengyang Xiao, Chengyu Fang, Longxiang Tang, Yulun Zhang, Linghe Kong, Deng-Ping Fan, Kai Li, Sina Farsiu

Title: RUN: Reversible Unfolding Network for Concealed Object Segmentation

Abstract:
Concealed object segmentation (COS) is a challenging problem that focuses on identifying objects that are visually blended into their background. Existing methods often employ reversible strategies to concentrate on uncertain regions but only focus on the mask level, overlooking the valuable of the RGB domain. To address this, we propose a Reversible Unfolding Network (RUN) in this paper. RUN formulates the COS task as a foreground-background separation process and incorporates an extra residual sparsity constraint to minimize segmentation uncertainties. The optimization solution of the proposed model is unfolded into a multistage network, allowing the original fixed parameters to become learnable. Each stage of RUN consists of two reversible modules: the Segmentation-Oriented Foreground Separation (SOFS) module and the Reconstruction-Oriented Background Extraction (ROBE) module. SOFS applies the reversible strategy at the mask level and introduces Reversible State Space to capture non-local information. ROBE extends this to the RGB domain, employing a reconstruction network to address conflicting foreground and background regions identified as distortion-prone areas, which arise from their separate estimation by independent modules. As the stages progress, RUN gradually facilitates reversible modeling of foreground and background in both the mask and RGB domains, reducing false-positive and false-negative regions. Extensive experiments demonstrate the superior performance of RUN and underscore the promise of unfolding-based frameworks for COS and other high-level vision tasks. Code is available athttps://github.com/ChunmingHe/RUN.

Paperid:1445 poster

Authors:Ziheng Qin, Hailun Xu, Wei Chee Yew, Qi Jia, Yang Luo, Kanchan Sarkar, Danhui Guan, Kai Wang, Yang You

Title: Info-Coevolution: An Efficient Framework for Data Model Coevolution

Abstract:
Machine learning relies heavily on data, yet the continuous growth of real-world data poses challenges for efficient dataset construction and training. A fundamental yet unsolved question is: given our current model and data, does a new data (sample/batch) need annotation/learning? Conventional approaches retain all available data, leading to non-optimal data and training efficiency. Active learning aims to reduce data redundancy by selecting a subset of samples to annotate, while it increases pipeline complexity and introduces bias. In this work, we propose Info-Coevolution, a novel framework that efficiently enables models and data to coevolve through online selective annotation with no bias. Leveraging task-specific models (and open-source models), it selectively annotates and integrates online and web data to improve datasets efficiently. For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32% without performance loss. It is able to automatically give the saving ratio without tuning the ratio. It can further reduce the annotation ratio to 50% with semi-supervised learning. We also explore retrieval-based dataset enhancement using unlabeled open-source data. Code is available athttps://github.com/NUS-HPC-AI-Lab/Info-Coevolution/.

Paperid:1446 poster

Authors:Hang Guo, Yawei Li, Tao Dai, Shu-Tao Xia, Luca Benini

Title: IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Abstract:
Fine-tuning pre-trained diffusion models under limited budgets has gained great success. In particular, the recent advances that directly fine-tune the quantized weights using Low-rank Adaptation (LoRA) further reduces training costs. Despite these progress, we point out that existing adaptation recipes are not inference-efficient. Specifically, additional post-training quantization (PTQ) on tuned weights is needed during deployment, which results in noticeable performance drop when the bit-width is low. Based on this observation, we introduce IntLoRA, which adapts quantized diffusion models with integer-type low-rank parameters, to include inference efficiency during tuning. Specifically, IntLoRA enables pre-trained weights to remain quantized during training, facilitating fine-tuning on consumer-level GPUs. During inference, IntLoRA weights can be seamlessly merged into pre-trained weights to directly obtain quantized downstream weights without PTQ. Extensive experiments show our IntLoRA achieves significant speedup on both training and inference without losing performance.

Paperid:1447 poster

Authors:Fan Qi, Daxu Shi, Chuokun Xu, Shuai Li, Changsheng Xu

Title: Rethinking the Temperature for Federated Heterogeneous Distillation

Abstract:
Federated Distillation (FedKD) relies on lightweight knowledge carriers like logits for efficient client-server communication. Although logit-based methods have demonstrated promise in addressing statistical and architectural heterogeneity in federated learning (FL), current approaches remain constrained by suboptimal temperature calibration during knowledge fusion. To address these limitations, we propose ReT-FHD, a framework featuring: 1) Multi-level Elastic Temperature, which dynamically adjusts distillation intensities across model layers, achieving optimized knowledge transfer between heterogeneous local models; 2) Category-Aware Global Temperature Scaling that implements class-specific temperature calibration based on confidence distributions in global logits, enabling personalized distillation policies; 3) Z-Score Guard, a blockchain-verified validation mechanism mitigating 44% of label-flipping and model poisoning attacks. Evaluations across diverse benchmarks with varying model/data heterogeneity demonstrate that the ReT-FHD achieves significant accuracy improvements over baseline methods while substantially reducing communication costs compared to existing approaches. Our work establishes that properly calibrated logits can serve as self-sufficient carriers for building scalable and secure heterogeneous FL systems.

Paperid:1448 poster

Authors:Qiang Chen, Zhongze Wu, Xiu Su, Xi Lin, Zhe Qu, Shan You, Shuo Yang, Chang Xu

Title: Stable Fair Graph Representation Learning with Lipschitz Constraint

Abstract:
Group fairness based on adversarial training has gained significant attention on graph data, which was implemented by masking sensitive attributes to generate fair feature views. However, existing models suffer from training instability due to uncertainty of the generated masks and the trade-off between fairness and utility. In this work, we propose a stable fair Graph Neural Network (SFG) to maintain training stability while preserving accuracy and fairness performance. Specifically, we first theoretically derive a tight upper Lipschitz bound to control the stability of existing adversarial-based models and employ a stochastic projected subgradient algorithm to constrain the bound, which operates in a block-coordinate manner. Additionally, we construct the uncertainty set to train the model, which can prevent unstable training by dropping some overfitting nodes caused by chasing fairness. Extensive experiments conducted on three real-world datasets demonstrate that SFG is stable and outperforms other state-of-the-art adversarial-based methods in terms of both fairness and utility performance. Codes are available athttps://github.com/sh-qiangchen/SFG.

Paperid:1449 poster

Authors:Jiazhong Cen, Xudong Zhou, Jiemin Fang, Changsong Wen, Lingxi Xie, XIAOPENG ZHANG, Wei Shen, Qi Tian

Title: Tackling View-Dependent Semantics in 3D Language Gaussian Splatting

Abstract:
Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints—a phenomenon we termview-dependent semantics. To address this challenge, we proposeLaGa(LanguageGaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of+18.7% mIoUover the previous SOTA on the LERF-OVS dataset. Our code is available at:https://github.com/https://github.com/SJTU-DeepVisionLab/LaGa.

Paperid:1450 poster

Authors:Elizabeth Fons, Alejandro Sztrajman, Yousef El-Laham, Luciana Ferrer, Svitlana Vyetrenko, Manuela Veloso

Title: LSCD: Lomb--Scargle Conditioned Diffusion for Time series Imputation

Abstract:
Time series with missing or irregularly sampled data are a persistent challenge in machine learning. Many methods operate on the frequency-domain, relying on the Fast Fourier Transform (FFT) which assumes uniform sampling, therefore requiring prior interpolation that can distort the spectra. To address this limitation, we introduce a differentiable Lomb--Scargle layer that enables a reliable computation of the power spectrum of irregularly sampled data. We integrate this layer into a novel score-based diffusion model (LSCD) for time series imputation conditioned on the entire signal spectrum. Experiments on synthetic and real-world benchmarks demonstrate that our method recovers missing data more accurately than purely time-domain baselines, while simultaneously producing consistent frequency estimates. Crucially, our method can be easily integrated into learning frameworks, enabling broader adoption of spectral guidance in machine learning approaches involving incomplete or irregular data.

Paperid:1451 poster

Authors:Zhaorun Chen, Mintong Kang, Bo Li

Title: ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Abstract:
Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents. Our project is available and continuously maintained here:https://shieldagent-aiguard.github.io/

Paperid:1452 poster

Authors:Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach

Title: DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Abstract:
The proliferation of disinformation demands reliable and scalable fact-checking solutions. We presentDynamicEvidence-basedFAct-checking withMultimodalExperts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claimsandevidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVeriTeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new general state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking.

Paperid:1453 poster

Authors:Xuantong LIU, Shaozhe Hao, Xianbiao Qi, Tianyang Hu, Jun Wang, Rong Xiao, Yuan Yao

Title: Elucidating the design space of language models for image generation

Abstract:
The success of large language models (LLMs) in text generation has inspired their application to image generation. However, existing methods either rely on specialized designs with inductive biases or adopt LLMs without fully exploring their potential in vision tasks. In this work, we systematically investigate the design space of LLMs for image generation and demonstrate that LLMs can achieve near state-of-the-art performance without domain-specific designs, simply by making proper choices in tokenization methods, modeling approaches, scan patterns, vocabulary design, and sampling strategies. We further analyze autoregressive models' learning and scaling behavior, revealing how larger models effectively capture more useful information than the smaller ones. Additionally, we explore the inherent differences between text and image modalities, highlighting the potential of LLMs across domains. The exploration provides valuable insights to inspire more effective designs when applying LLMs to other domains. With extensive experiments, our proposed model,ELMachieves an FID of 1.54 on 256$\times$256 ImageNet and an FID of 3.29 on 512$\times$512 ImageNet, demonstrating the powerful generative potential of LLMs in vision tasks.

Paperid:1454 poster

Authors:Daniele Tramontano, Yaroslav Kivva, Saber Salehkaleybar, Negar Kiyavash, Mathias Drton

Title: Causal Effect Identification in lvLiNGAM from Higher-Order Cumulants

Abstract:
This paper investigates causal effect identification in latent variable Linear Non-Gaussian Acyclic Models (lvLiNGAM) using higher-order cumulants, addressing two prominent setups that are challenging in the presence of latent confounding: (1) a single proxy variable that may causally influence the treatment and (2) underspecified instrumental variable cases where fewer instruments exist than treatments. We prove that causal effects are identifiable with a single proxy or instrument and provide corresponding estimation methods. Experimental results demonstrate the accuracy and robustness of our approaches compared to existing methods, advancing the theoretical and practical understanding of causal inference in linear systems with latent confounders.

Paperid:1455 poster

Authors:Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, MuRun Yang, Bei Li, Tong Xiao, Chunliang Zhang, Tongran Liu, JingBo Zhu

Title: GRAM: A Generative Foundation Reward Model for Reward Generalization

Abstract:
In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.

Paperid:1456 poster

Authors:Charlie Hou, Mei-Yu Wang, Yige Zhu, Daniel Lazar, Giulia Fanti

Title: Private Federated Learning using Preference-Optimized Synthetic Data

Abstract:
In practical settings, differentially private federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering based on public information and/or iterative private client feedback. Our key insight is that the private client feedback collected by prior DP synthetic data methods (Hou et al., 2024; Xie et al., 2024) can be viewed as a preference ranking. Our algorithm, Preference Optimization for Private Client Data (POPri) harnesses client feedback using preference optimization algorithms such as Direct Preference Optimization (DPO) to fine-tune LLMs to generate high-quality DP synthetic data. To evaluate POPri, we release LargeFedBench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri closes the gap in next-token prediction accuracy between the fully-private and non-private settings by up to 68%, compared to 52% for prior synthetic data methods, and 10% for state-of-the-art DP federated learning methods. The code and data are available athttps://github.com/meiyuw/POPri.

Paperid:1457 poster

Authors:Yanxiang Ma, Zixuan Huang, Minjing Dong, Shan You, Chang Xu

Title: Adversarial Robustness via Deformable Convolution with Stochasticity

Abstract:
Random defense represents a promising strategy to protect neural networks from adversarial attacks. Most of these methods enhance robustness by injecting randomness into the data, increasing uncertainty for attackers. However, this randomness could reduce the generalization capacity of defense, as defense performance could be sensitive to the hyperparameters of noise added to the data, making it difficult to generalize across different datasets. Additionally, the involvement of randomness always comes with a reduction of natural accuracy, which leads to a delicate trade-off between them, which is seldom studied in random defense. In this work, we propose incorporating randomness into the network structure instead of data input by designing stochastic deformable convolution, where a random mask replaces the convolutional offset. This process promotes data independence, enhancing generalization across datasets. To study the trade-off, we conduct a theoretical analysis of both robust and clean accuracy, from a perspective of gradient cosine similarity and natural inference. Based on the analysis, we reformulate the adversarial training in our random defense framework. Extensive experiments show that our method achieves SOTA adversarial robustness and clean accuracy compared with other random defense methods.

Paperid:1458 poster

Authors:Yueheng Li, Guangming Xie, Zongqing Lu

Title: Revisiting Cooperative Off-Policy Multi-Agent Reinforcement Learning

Abstract:
Cooperative Multi-Agent Reinforcement Learning (MARL) has become a critical tool for addressing complex real-world problems. However, off-policy MARL methods, which rely on joint Q-functions, face significant scalability challenges due to the exponentially growing joint action space. In this work, we highlight a critical yet often overlooked issue: erroneous Q-target estimation, primarily caused by extrapolation error. Our analysis reveals that this error becomes increasingly severe as the number of agents grows, leading to unique challenges in MARL due to its expansive joint action space and the decentralized execution paradigm. To address these challenges, we propose a suite of techniques tailored for off-policy MARL, including annealed multi-step bootstrapping, averaged Q-targets, and restricted action representation. Experimental results demonstrate that these methods effectively mitigate erroneous estimations, yielding substantial performance improvements in challenging benchmarks such as SMAC, SMACv2, and Google Research Football.

Paperid:1459 poster

Authors:Yujin Han, Andi Han, Wei Huang, Chaochao Lu, Difan Zou

Title: Can Diffusion Models Learn Hidden Inter-Feature Rules Behind Images?

Abstract:
Despite the remarkable success of diffusion models (DMs) in data generation, they exhibit specific failure cases with unsatisfactory outputs. We focus on one such limitation: the ability of DMs to learn hidden rules between image features. Specifically, for image data with dependent features ($\mathbf{x}$) and ($\mathbf{y}$) (e.g., the height of the sun ($\mathbf{x}$) and the length of the shadow ($\mathbf{y}$)), we investigate whether DMs can accurately capture the inter-feature rule ($p(\mathbf{y}|\mathbf{x})$). Empirical evaluations on mainstream DMs (e.g., Stable Diffusion 3.5) reveal consistent failures, such as inconsistent lighting-shadow relationships and mismatched object-mirror reflections. Inspired by these findings, we design four synthetic tasks with strongly correlated features to assess DMs' rule-learning abilities. Extensive experiments show that while DMs can identify coarse-grained rules, they struggle with fine-grained ones. Our theoretical analysis demonstrates that DMs trained via denoising score matching (DSM) exhibit constant errors in learning hidden rules, as the DSM objective is not compatible with rule conformity. To mitigate this, we introduce a common technique - incorporating additional classifier guidance during sampling, which achieves (limited) improvements. Our analysis reveals that the subtle signals of fine-grained rules are challenging for the classifier to capture, providing insights for future exploration.

Paperid:1460 poster

Authors:Lei Yan, Xin Zhang, Qing Mai

Title: Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Abstract:
Scientific and engineering applications are often heterogeneous, making it beneficial to account for latent clusters or sub-populations when learning low-dimensional subspaces in supervised learning, and vice versa. In this paper, we combine the concept of subspace clustering with model-based sufficient dimension reduction and thus generalize the sufficient dimension reduction framework from homogeneous regression setting to heterogeneous data applications. In particular, we propose the mixture of principal fitted components (mixPFC) model, a novel framework that simultaneously achieves clustering, subspace estimation, and variable selection, providing a unified solution for high-dimensional heterogeneous data analysis. We develop a group Lasso penalized expectation-maximization (EM) algorithm and obtain its non-asymptotic convergence rate. Through extensive simulation studies, mixPFC demonstrates superior performance compared to existing methods across various settings. Applications to real world datasets further highlight its effectiveness and practical advantages.

Paperid:1461 poster

Authors:Sifan Yang, Yuanyu Wan, Peijia Li, Yibo Wang, Xiao Zhang, Zhewei Wei, Lijun Zhang

Title: Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Abstract:
In this paper, we investigate the acceleration of adaptive subgradient methods through frequent directions (FD), a widely-used matrix sketching technique. The state-of-the-art regret bound exhibits alineardependence on the dimensionalityd, leading to unsatisfactory guarantees for high-dimensional problems. Additionally, it suffers from anO(τ2d)time complexity per round, which scales quadratically with the sketching sizeτ. To overcome these issues, we first propose an algorithm named FTSL, achieving a tighter regret bound that is independent of the dimensionality. The key idea is to integrate FD with adaptive subgradient methods underthe primal-dual frameworkand add the cumulative discarded information of FD back. To reduce its time complexity, we further utilize fast FD to expedite FTSL, yielding a better complexity ofO(τd)while maintaining the same regret bound. Moreover, to mitigate the computational cost for optimization problems involving matrix variables (e.g., training neural networks), we adapt FD to Shampoo, a popular optimization algorithm that accounts for the structure of decision, and give a novel analysis underthe primal-dual framework. Our proposed method obtains an improved dimension-free regret bound. Experimental results have verified the efficiency and effectiveness of our approaches.

Paperid:1462 poster

Authors:Spencer Hutchinson, Mahnoosh Alizadeh

Title: Constrained Online Convex Optimization with Polyak Feasibility Steps

Abstract:
In this work, we study online convex optimization with a fixed constraint function $g : \mathbb{R}^d \rightarrow \mathbb{R}$. Prior work on this problem has shown $O(\sqrt{T})$ regret and cumulative constraint satisfaction $\sum_{t=1}^{T} g(x_t) \leq 0$, while only accessing the constraint value and subgradient at the played actions $g(x_t), \partial g(x_t)$. Using the same constraint information, we show a stronger guarantee of anytime constraint satisfaction $g(x_t) \leq 0 \ \forall t \in [T]$, and matching $O(\sqrt{T})$ regret guarantees. These contributions are thanks to our approach of using Polyak feasibility steps to ensure constraint satisfaction, without sacrificing regret. Specifically, after each step of online gradient descent, our algorithm applies a subgradient descent step on the constraint function where the step-size is chosen according to the celebrated Polyak step-size. We further validate this approach with numerical experiments.

Paperid:1463 poster

Authors:Pengkang Guo, Bruno Correia, Pierre Vandergheynst, Daniel Probst

Title: Boosting Protein Graph Representations through Static-Dynamic Fusion

Abstract:
Machine learning for protein modeling faces significant challenges due to proteins' inherently dynamic nature, yet most graph-based machine learning methods rely solely on static structural information. Recently, the growing availability of molecular dynamics trajectories provides new opportunities for understanding the dynamic behavior of proteins; however, computational methods for utilizing this dynamic information remain limited. We propose a novel graph representation that integrates both static structural information and dynamic correlations from molecular dynamics trajectories, enabling more comprehensive modeling of proteins. By applying relational graph neural networks (RGNNs) to process this heterogeneous representation, we demonstrate significant improvements over structure-based approaches across three distinct tasks: atomic adaptability prediction, binding site detection, and binding affinity prediction. Our results validate that combining static and dynamic information provides complementary signals for understanding protein-ligand interactions, offering new possibilities for drug design and structural biology applications.

Paperid:1464 poster

Authors:Robert Istvan Busa-Fekete, Travis Dick, Claudio Gentile, Haim Kaplan, Tomer Koren, Uri Stemmer

Title: Nearly Optimal Sample Complexity for Learning with Label Proportions

Abstract:
We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags, and only aggregate label values in each bag are available. Despite the partial observability, the goal is still to achieve small regret at the level of individual examples. We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal. From an algorithmic viewpoint, we rely on carefully designed variants of Empirical Risk Minimization, and Stochastic Gradient Descent algorithms, combined with ad hoc variance reduction techniques. On one hand, our theoretical results improve in important ways on the existing literature on LLP, specifically in the way the sample complexity depends on the bag size. On the other hand, we validate our algorithmic solutions on several datasets, demonstrating improved empirical performance (better accuracy for less samples) against recent baselines.

Paperid:1465 poster

Authors:Zhepei Wei, Wei-Lin Chen, Xinyu Zhu, Yu Meng

Title: AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Abstract:
Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware’s parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary “drafter�?model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the generated outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens—particularly simple or highly-predictable ones—can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token’s computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early predictions match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks shows that AdaDecode consistently achieves superior decoding throughput compared to baselines with up to 1.73$\times$ speedup, while guaranteeing output parity with standard autoregressive decoding.

Paperid:1466 poster

Authors:Robert A. Vandermeulen, Wai Ming Tai, Bryon Aragam

Title: Dimension-Independent Rates for Structured Neural Density Estimation

Abstract:
We show that deep neural networks can achieve dimension-independent rates of convergence for learning structured densities typical of image, audio, video, and text data. For example, in images, where each pixel becomes independent of the rest of the image when conditioned on pixels at most $t$ steps away, a simple $L^2$-minimizing neural network can attain a rate of $n^{-1/((t+1)^2+4)}$, where $t$ is independent of the ambient dimension $d$, i.e. the total number of pixels. We further provide empirical evidence that, in real-world applications, $t$ is often a small constant, thus effectively circumventing the curse of dimensionality. Moreover, for sequential data (e.g., audio or text) exhibiting a similar local dependence structure, our analysis shows a rate of $n^{-1/(t+5)}$, offering further evidence of dimension independence in practical scenarios.

Paperid:1467 poster

Authors:Daria Lioubashevski, Tomer M. Schlank, Gabriel Stanovsky, Ariel Goldstein

Title: Looking Beyond the Top-1: Transformers Determine Top Tokens in Order

Abstract:
Uncovering the inner mechanisms of Transformer models offers insights into how they process and represent information. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction remains fixed, known as the “saturation event�? We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens�?ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a token-level early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling.

Paperid:1468 poster

Authors:Kyle Richardson, Vivek Srikumar, Ashish Sabharwal

Title: Understanding the Logic of Direct Preference Alignment through Logic

Abstract:
Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development of many new variants of the original DPO loss, understanding the differences between these recent proposals, as well as developing new DPA loss functions, remains difficult given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic program that characterizes its semantics? We propose a novel formalism for characterizing preference losses for single model and reference model based approaches, and identify symbolic forms for a number of commonly used DPA variants. Further, we show how this formal view of preference learning sheds new light on both the size and structure of the DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent loss proposals but also to systematically explore the landscape and derive new loss functions from first principles. We hope our framework and findings will help provide useful guidance to those working on human AI alignment.

Paperid:1469 poster

Authors:Ruchika Chavhan, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos, Luca Morreale, Mehdi Noroozi, Sourav Bhattacharya

Title: Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

Abstract:
Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We proposeMulti-Task Upcycling(MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to asexperts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks includingimage editing, super-resolution, andinpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.

Paperid:1470 poster

Authors:Yuan Xin, Dingfan Chen, Michael Backes, Xiao Zhang

Title: Provably Cost-Sensitive Adversarial Defense via Randomized Smoothing

Abstract:
As machine learning models are deployed in critical applications, robustness against adversarial perturbations is crucial. While numerous defensive algorithms have been proposed to counter such attacks, they typically assume that all adversarial transformations are equally important, an assumption that rarely aligns with real-world applications. To address this, we study the problem of robust learning against adversarial perturbations under cost-sensitive scenarios, where the potential harm of different types of misclassifications is encoded in a cost matrix. Our solution introduces a provably robust learning algorithm to certify and optimize for cost-sensitive robustness, building on the scalable certification framework of randomized smoothing. Specifically, we formalize the definition of cost-sensitive certified radius and propose our novel adaptation of the standard certification algorithm to generate tight robustness certificates tailored to any cost matrix. In addition, we design a robust training method that improves certified cost-sensitive robustness without compromising model accuracy. Extensive experiments on benchmark datasets, including challenging ones unsolvable by existing methods, demonstrate the effectiveness of our certification algorithm and training method across various cost-sensitive scenarios.

Paperid:1471 poster

Authors:Wenxin Tai, Ting Zhong, Goce Trajcevski, Fan Zhou

Title: Redundancy Undermines the Trustworthiness of Self-Interpretable GNNs

Abstract:
This work presents a systematic investigation into the trustworthiness of explanations generated by self-interpretable graph neural networks (GNNs), revealing why models trained with different random seeds yield inconsistent explanations. We identify redundancy—resulting from weak conciseness constraints—as the root cause of both explanation inconsistency and its associated inaccuracy, ultimately hindering user trust and limiting GNN deployment in high-stakes applications. Our analysis demonstrates that redundancy is difficult to eliminate; however, a simple ensemble strategy can mitigate its detrimental effects. We validate our findings through extensive experiments across diverse datasets, model architectures, and self-interpretable GNN frameworks, providing a benchmark to guide future research on addressing redundancy and advancing GNN deployment in critical domains. Our code is available at \url{https://github.com/ICDM-UESTC/TrustworthyExplanation}.

Paperid:1472 poster

Authors:Marta Aparicio Rodriguez, Xenia Miscouridou, Anastasia Borovykh

Title: Concept Reachability in Diffusion Models: Beyond Dataset Constraints

Abstract:
Despite significant advances in quality and complexity of the generations in text-to-image models,promptingdoes not always lead to the desired outputs. Controlling model behaviour by directlysteeringintermediate model activations has emerged as a viable alternative allowing toreachconcepts in latent space that may otherwise remain inaccessible by prompt. In this work, we introduce a set of experiments to deepen our understanding of concept reachability. We design a training data setup with three key obstacles: scarcity of concepts, underspecification of concepts in the captions, and data biases with tied concepts. Our results show: (i) concept reachability in latent space exhibits a distinct phase transition, with only a small number of samples being sufficient to enable reachability, (ii)wherein the latent space the intervention is performed critically impacts reachability, showing that certain concepts are reachable only at certain stages of transformation, and (iii) while prompting ability rapidly diminishes with a decrease in quality of the dataset, concepts often remain reliably reachable through steering. Model providers can leverage this to bypass costly retraining and dataset curation and instead innovate with user-facing control mechanisms.

Paperid:1473 poster

Authors:Yunxia Wang, Fuyuan CAO, Kui Yu, Jiye Liang

Title: Federated Causal Structure Learning with Non-identical Variable Sets

Abstract:
Federated causal structure learning aims to infer causal relationships from data stored on individual clients, with privacy concerns. Most existing methods assume identical variable sets across clients and present federated strategies for aggregating local updates. However, in practice, clients often observe overlapping but non-identical variable sets, and non-overlapping variables may introduce spurious dependencies. Moreover, existing strategies typically reflect only the overall quality of local graphs, ignoring the varying importance of relationships within each graph. In this paper, we study federated causal structure learning with non-identical variable sets, aiming to design an effective strategy for aggregating “correct�?and “good�?(non-)causal relationships across distributed datasets. Specifically, we first develop theories for detecting spurious dependencies, examining whether the learned relationships are “correct�?or not. Furthermore, we define stable relationships as those that are both “correct�?and “good�?across multiple graphs, and finally design a two-level priority selection strategy for aggregating local updates, obtaining a global causal graph over the integrated variables. Experimental results on synthetic, benchmark and real-world data demonstrate the effectiveness of our method.

Paperid:1474 poster

Authors:Louis Serrano, Armand Kassaï Koupaï, Thomas X Wang, Pierre ERBACHER, Patrick Gallinari

Title: Zebra: In-Context Generative Pretraining for Solving Parametric PDEs

Abstract:
Solving time-dependent parametric partial differential equations (PDEs) is challenging for data-driven methods, as these models must adapt to variations in parameters such as coefficients, forcing terms, and initial conditions. State-of-the-art neural surrogates perform adaptation through gradient-based optimization and meta-learning to implicitly encode the variety of dynamics from observations. This often comes with increased inference complexity. Inspired by the in-context learning capabilities of large language models (LLMs), we introduce Zebra, a novel generative auto-regressive transformer designed to solve parametric PDEs without requiring gradient adaptation at inference. By leveraging in-context information during both pre-training and inference, Zebra dynamically adapts to new tasks by conditioning on input sequences that incorporate context example trajectories. As a generative model, Zebra can be used to generate new trajectories and allows quantifying the uncertainty of the predictions. We evaluate Zebra across a variety of challenging PDE scenarios, demonstrating its adaptability, robustness, and superior performance compared to existing approaches.

Paperid:1475 poster

Authors:Lea Bogensperger, Dominik Narnhofer, Ahmed Allam, Konrad Schindler, Michael Krauthammer

Title: A Variational Perspective on Generative Protein Fitness Optimization

Abstract:
The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduceVariational Latent Generative Protein Optimization(VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves state-of-the-art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug-and-play framework that can be easily customized to suit various protein design tasks.

Paperid:1476 poster

Authors:R Majumdar, Mahmoud Salamati, Sadegh Soudjani

Title: Regret-Free Reinforcement Learning for Temporal Logic Specifications

Abstract:
Learning to control an unknown dynamical system with respect to high-level temporal specifications is an important problem in control theory. We present the first regret-free online algorithm for learning a controller for linear temporal logic (LTL) specifications for systems with unknown dynamics. We assume that the underlying (unknown) dynamics is modeled by a finite-state and action Markov decision process (MDPs). Our core technical result is a regret-free learning algorithm for infinite-horizon reach-avoid problems on MDPs. For general LTL specifications, we show that the synthesis problem can be reduced to a reach-avoid problem once the graph structure is known. Additionally, we provide an algorithm for learning the graph structure, assuming knowledge of a minimum transition probability, which operates independently of the main regret-free algorithm. Our LTL controller synthesis algorithm provides sharp bounds on how close we are to achieving optimal behavior after a finite number of learning episodes. In contrast, previous algorithms for LTL synthesis only provide asymptotic guarantees, which give no insight into the transient performance during the learning phase.

Paperid:1477 poster

Authors:Yiming Yang, Xiaoyuan Cheng, Daniel Giles, Sibo Cheng, Yi He, Xiao Xue, Boli Chen, Yukun Hu

Title: Tensor-Var: Efficient Four-Dimensional Variational Data Assimilation

Abstract:
Variational data assimilation estimates the dynamical system states by minimizing a cost function that fits the numerical models with the observational data. Although four-dimensional variational assimilation (4D-Var) is widely used, it faces high computational costs in complex nonlinear systems and depends on imperfect state-observation mappings. Deep learning (DL) offers more expressive approximators, while integrating DL models into 4D-Var is challenging due to their nonlinearities and lack of theoretical guarantees in assimilation results. In this paper, we propose \textit{Tensor-Var}, a novel framework that integrates kernel conditional mean embedding (CME) with 4D-Var to linearize nonlinear dynamics, achieving convex optimization in a learned feature space. Moreover, our method provides a new perspective for solving 4D-Var in a linear way, offering theoretical guarantees of consistent assimilation results between the original and feature spaces. To handle large-scale problems, we propose a method to learn deep features (DFs) using neural networks within the Tensor-Var framework. Experiments on chaotic systems and global weather prediction with real-time observations show that Tensor-Var outperforms conventional and DL hybrid 4D-Var baselines in accuracy while achieving a 10- to 20-fold speed improvement.

Paperid:1478 poster

Authors:Steinar Laenen, Peter Macgregor, He Sun

Title: Dynamic Similarity Graph Construction with Kernel Density Estimation

Abstract:
In the kernel density estimation (KDE) problem, we are given a set $X$ of data points in $\mathbb{R}^d$, a kernel function $k: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$, and a query point $\mathbf{q} \in \mathbb{R}^d$, and the objective is to quickly output an estimate of $\sum_{\mathbf{x} \in X} k(\mathbf{q}, \mathbf{x})$. In this paper, we consider $\textsf{KDE}$ in the dynamic setting, and introduce a data structure that efficiently maintains theestimatesfor a set of query points as data points are added to $X$ over time. Based on this, we design a dynamic data structure that maintains a sparse approximation of the fully connected similarity graph on $X$, and develop a fast dynamic spectral clustering algorithm. We further evaluate the effectiveness of our algorithms on both synthetic and real-world datasets.

Paperid:1479 poster

Authors:Guy Hacohen, Tinne Tuytelaars

Title: Predicting the Susceptibility of Examples to Catastrophic Forgetting

Abstract:
Catastrophic forgetting -- the tendency of neural networks to forget previously learned data when learning new information -- remains a central challenge in continual learning. In this work, we adopt a behavioral approach, observing a connection between learning speed and forgetting: examples learned more quickly are less prone to forgetting. Focusing on replay-based continual learning, we show that the composition of the replay buffer -- specifically, whether it contains quickly or slowly learned examples -- has a significant effect on forgetting. Motivated by this insight, we introduce Speed-Based Sampling (SBS), a simple yet general strategy that selects replay examples based on their learning speed. SBS integrates easily into existing buffer-based methods and improves performance across a wide range of competitive continual learning benchmarks, advancing state-of-the-art results. Our findings underscore the value of accounting for the forgetting dynamics when designing continual learning algorithms.

Paperid:1480 poster

Authors:Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh

Title: EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

Abstract:
The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the "importance" of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance. To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.

Paperid:1481 poster

Authors:Haorun Cai, Han-Jia Ye

Title: Understanding the Limits of Deep Tabular Methods with Temporal Shift

Abstract:
Deep tabular models have demonstrated remarkable success on i.i.d. data, excelling in a variety of structured data tasks. However, their performance often deteriorates under temporal distribution shifts, where trends and periodic patterns are present in the evolving data distribution over time. In this paper, we explore the underlying reasons for this failure in capturing temporal dependencies. We begin by investigating the training protocol, revealing a key issue in how the data is split for model training and validation. While existing approaches typically use temporal ordering for splitting, we show that even a random split significantly improves model performance. By accounting for reducing training lag and validation bias to achieve better generalization ability, our proposed splitting protocol offers substantial improvements across a variety of methods. Furthermore, we analyses how temporal data affects deep tabular representations, uncovering that these models often fail to capture crucial periodic and trend information. To address this gap, we introduce a plug-and-play temporal embedding based on Fourier series expansion to learn and incorporate temporal patterns, offering an adaptive approach to handle temporal shifts. Our experiments demonstrate that this temporal embedding, combined with the improved splitting strategy, provides a more effective and robust framework for learning from temporal tabular data.

Paperid:1482 poster

Authors:Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming HU, Jianxiong Li, Xianyuan Zhan

Title: Efficient Robotic Policy Learning via Latent Space Backward Planning

Abstract:
Current robotic planning methods often rely on predicting multi-frame images with full pixel details. While this fine-grained approach can serve as a generic world model, it introduces two significant challenges for downstream policy learning: substantial computational costs that hinder real-time deployment, and accumulated inaccuracies that can mislead action extraction. Planning with coarse-grained subgoals partially alleviates efficiency issues. However, their forward planning schemes can still result in off-task predictions due to accumulation errors, leading to misalignment with long-term goals. This raises a critical question: Can robotic planning be both efficient and accurate enough for real-time control in long-horizon, multi-stage tasks? To address this, we propose aBackwardPlanning scheme inLatent space (LBP), which begins by grounding the task into final latent goals, followed by recursively predicting intermediate subgoals closer to the current state. The grounded final goal enables backward subgoal planning to always remain aware of task completion, facilitating on-task prediction along the entire planning horizon. The subgoal-conditioned policy incorporates a learnable token to summarize the subgoal sequences and determines how each subgoal guides action extraction. Through extensive simulation and real-robot long-horizon experiments, we show that LBP outperforms existing fine-grained and forward planning methods, achieving SOTA performance. Project Page:https://lbp-authors.github.io.

Paperid:1483 poster

Authors:Guangzhi Sun, Xiao Zhan, Shutong Feng, Phil Woodland, Jose Such

Title: CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models

Abstract:
Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the importance of the context where the query occurs and may cause undesired refusal of queries under safe contexts that diminish user experience. Addressing this gap, we introduce CASE-Bench, a Context-Aware SafEty Benchmark that integrates context into safety assessments of LLMs. CASE-Bench assigns distinct, formally described contexts to categorized queries based on Contextual Integrity theory. Additionally, in contrast to previous studies which mainly rely on majority voting from just a few annotators, we recruited a sufficient number of annotators necessary to ensure the detection of statistically significant differences among the experimental conditions based on power analysis. Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments (p<0.0001 from a z-test), underscoring the necessity of context in safety evaluations. We also identify notable mismatches between human judgments and LLM responses, particularly in commercial models within safe contexts. Code and data used in the paper are available athttps://anonymous.4open.science/r/CASEBench-D5DB.

Paperid:1484 poster

Authors:zhou changshi, Feng Luan, hujiarui, Shaoqiang Meng, Zhipeng Wang, Yanchao Dong, Yanmin Zhou, Bin He

Title: Learning Efficient Robotic Garment Manipulation with Standardization

Abstract:
Garment manipulation is a significant challenge for robots due to the complex dynamics and potential self-occlusion of garments. Most existing methods of efficient garment unfolding overlook the crucial role of standardization of flattened garments, which could significantly simplify downstream tasks like folding, ironing, and packing. This paper presents APS-Net, a novel approach to garment manipulation that combines unfolding and standardization in a unified framework. APS-Net employs a dual-arm, multi-primitive policy with dynamic fling to quickly unfold crumpled garments and pick-and-place(p&p) for precise alignment. The purpose of garment standardization during unfolding involves not only maximizing surface coverage but also aligning the garment’s shape and orientation to predefined requirements. To guide effective robot learning, we introduce a novel factorized reward function for standardization, which incorporates garment coverage (Cov), keypoint distance (KD), and intersection-over-union (IoU) metrics. Additionally, we introduce a spatial action mask and an Action Optimized Module to improve unfolding efficiency by selecting actions and operation points effectively. In simulation, APS-Net outperforms state-of-the-art methods for long sleeves, achieving 3.9% better coverage, 5.2% higher IoU, and a 0.14 decrease in KD (7.09% relative reduction). Real-world folding tasks further demonstrate that standardization simplifies the folding process. Project page:https://hellohaia.github.io/APS/

Paperid:1485 poster

Authors:Zhengrui Ma, Yang Feng, Min zhang

Title: Overcoming Non-monotonicity in Transducer-based Streaming Generation

Abstract:
Streaming generation models are utilized across fields, with the Transducer architecture being popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation. In this research, we address this issue by integrating Transducer's decoding with the history of input stream via a learnable monotonic attention. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the monotonic context representations, thereby avoiding the need to enumerate the exponentially large alignment space during training. Extensive experiments show that our MonoAttn-Transducer effectively handles non-monotonic alignments in streaming scenarios, offering a robust solution for complex generation tasks. Code is available athttps://github.com/ictnlp/MonoAttn-Transducer.

Paperid:1486 poster

Authors:Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan

Title: CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Abstract:
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available athttps://github.com/yongchao98/CodeSteer-v1.0andhttps://huggingface.co/yongchao98.

Paperid:1487 poster

Authors:Jiayi Xin, Sukwon Yun, Jie Peng, Inyoung Choi, Jenna L. Ballard, Tianlong Chen, Qi Long

Title: I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts

Abstract:
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by(a)their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and(b)the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we proposeI2MoE(I―nterpretable MultimodalI―nteraction-awareM―ixture-o―f-E―xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First,I2MoEutilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second,I2MoEdeploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows thatI2MoEis flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available athttps://github.com/Raina-Xin/I2MoE.

Paperid:1488 poster

Authors:Rui Yang, Lin Song, Yicheng Xiao, Runhui Huang, Yixiao Ge, Ying Shan, Hengshuang Zhao

Title: HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Abstract:
Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Paperid:1489 poster

Authors:Arthur Deng, Karsten D. Householder, Fang Wu, K. Christopher Garcia, Brian L. Trippe

Title: Predicting mutational effects on protein binding from folding energy

Abstract:
Accurate estimation of mutational effects on protein-protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed but, presumably due to the scarcity of binding data, these methods under-perform computationally expensive estimates based on empirical force-fields. In response, we propose a transfer-learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements. The resulting predictor, StaB-ddG, is the first deep learning predictor to match the accuracy of the state-of-the-art empirical force-field method Flex ddG, while offering an over 10,000x speed-up.

Paperid:1490 poster

Authors:Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Tin Long Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Bai Xiaolong, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan

Title: Efficiently Serving Large Multimodal Models Using EPD Disaggregation

Abstract:
Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15× lower peak memory utilization), batch sizes (up to 22× larger), 10× more images per request, and 2.2× larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90�?00% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available athttps://github.com/vbdi/epdserve.

Paperid:1491 poster

Authors:Ayushi Mishra, Yang Bai, Priyadarshan Narayanasamy, Nakul Garg, Nirupam Roy

Title: SING: Spatial Context in Large Language Model for Next-Gen Wearables

Abstract:
Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72°—a substantial improvement compared to the 88.52° median error in existing work—with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16°. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.

Paperid:1492 poster

Authors:Shreshth Saini, Ru-Ling Liao, Yan Ye, Alan Bovik

Title: LGDM: Latent Guidance in Diffusion Models for Perceptual Evaluations

Abstract:
Despite recent advancements in latent diffusion models that generate high-dimensional image data and perform various downstream tasks, there has been little exploration into perceptual consistency within these models on the task of No-Reference Image Quality Assessment (NR-IQA). In this paper, we hypothesize that latent diffusion models implicitly exhibit perceptually consistent local regions within the data manifold. We leverage this insight to guide on-manifold sampling using perceptual features and input measurements. Specifically, we propose Perceptual Manifold Guidance (PMG), an algorithm that utilizes pretrained latent diffusion models and perceptual quality features to obtain perceptually consistent multi-scale and multi-timestep feature maps from the denoising U-Net. We empirically demonstrate that these hyperfeatures exhibit high correlation with human perception in IQA tasks. Our method can be applied to any existing pretrained latent diffusion model and is straightforward to integrate. To the best of our knowledge, this paper is the first work on guiding diffusion model with perceptual features for NR-IQA. Extensive experiments on IQA datasets show that our method, LGDM, achieves state-of-the-art performance, underscoring the superior generalization capabilities of diffusion models for NR-IQA tasks.

Paperid:1493 poster

Authors:Meshi Bashari, Matteo Sesia, Yaniv Romano

Title: Robust Conformal Outlier Detection under Contaminated Reference Data

Abstract:
Conformal prediction is a flexible framework for calibrating machine learning predictions, providing distribution-free statistical guarantees. In outlier detection, this calibration relies on a reference set of labeled inlier data to control the type-I error rate. However, obtaining a perfectly labeled inlier reference set is often unrealistic, and a more practical scenario involves access to a contaminated reference set containing a small fraction of outliers. This paper analyzes the impact of such contamination on the validity of conformal methods. We prove that under realistic, non-adversarial settings, calibration on contaminated data yields conservative type-I error control, shedding light on the inherent robustness of conformal methods. This conservativeness, however, typically results in a loss of power. To alleviate this limitation, we propose a novel, active data-cleaning framework that leverages a limited labeling budget and an outlier detection model to selectively annotate data points in the contaminated reference set that are suspected as outliers. By removing only the annotated outliers in this ``suspicious'' subset, we can effectively enhance power while mitigating the risk of inflating the type-I error rate, as supported by our theoretical analysis. Experiments on real datasets validate the conservative behavior of conformal methods under contamination and show that the proposed data-cleaning strategy improves power without sacrificing validity.

Paperid:1494 poster

Authors:Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason E Weston, Sainbayar Sukhbaatar

Title: Thinking LLMs: General Instruction Following with Thought Generation

Abstract:
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning -- but can be applied toanytask. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.

Paperid:1495 poster

Authors:Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra

Title: Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Abstract:
Optimal prompt selection is crucial for maximizing large language model (LLM) performance on downstream tasks, especially in black-box settings where models are only accessible via APIs. Black-box prompt selection is challenging due to potentially large, combinatorial search spaces, absence of gradient information, and high evaluation cost of prompts on a validation set. We propose HbBoPs, a novel method that combines a structural-aware deep kernel Gaussian Process with Hyperband as a multi-fidelity scheduler to efficiently select prompts. HbBoPs uses embeddings of instructions and few-shot exemplars, treating them as modular components within prompts. This enhances the surrogate model's ability to predict which prompt to evaluate next in a sample-efficient manner. Hyperband improves query-efficiency by adaptively allocating resources across different fidelity levels, reducing the number of validation instances required for evaluating prompts. Extensive experiments across ten diverse benchmarks and three LLMs demonstrate that HbBoPs outperforms state-of-the-art methods in both performance and efficiency.

Paperid:1496 poster

Authors:Jerry Huang, Peng Lu, QIUHAO Zeng

Title: Calibrated Language Models and How to Find Them with Label Smoothing

Abstract:
Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.

Paperid:1497 poster

Authors:Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li

Title: RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF

Abstract:
Reinforcement Learning from Human Feedback (RLHF) is widely used to align models with human preferences, particularly to enhance the safety of responses generated by LLMs. This method traditionally relies on choosing preferred responses from response pairs. However, due to variations in human opinions and the difficulty of making an overall comparison of two responses, there is a growing shift towards a fine-grained annotation approach, assessing responses based on multiple specific metrics or rules. Selecting and applying these rules efficiently while accommodating the diversity of preference data remains a significant challenge. In this paper, we introduce a dynamic approach that adaptively selects the most critical rules for each pair of responses. We develop a mathematical framework that leverages the maximum discrepancy between each paired responses and theoretically show that this strategy optimizes the mutual information between the rule-based labeling and the hidden ground-truth preferences. We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on RewardBench. As of May 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models.

Paperid:1498 poster

Authors:Yang Yu, Kai Han, Hang Zhou, Yehui Tang, Kaiqi Huang, Yunhe Wang, Dacheng Tao

Title: LLM Data Selection and Utilization via Dynamic Bi-level Optimization

Abstract:
While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.

Paperid:1499 poster

Authors:Jiaqi Wang, Jingtao Li, Weiming Zhuang, Chen Chen, Lingjuan Lyu, Fenglong Ma

Title: Enhancing Foundation Models with Federated Domain Knowledge Infusion

Abstract:
Vision foundation models (FMs) like CLIP have exhibited exceptional capabilities in visual and linguistic understanding, particularly in zero-shot inference tasks. However, these models struggle with data that significantly deviates from their training samples, necessitating fine-tuning, which is often infeasible in centralized settings due to data privacy concerns. Federated learning (FL) combined with parameter-efficient fine-tuning (PEFT) offers a potential solution, yet existing methods face issues with domain-specific characteristics and out-of-domain generalization. We propose a cross-silo Federated Adapter Generalization (FedAG), a novel federated fine-tuning approach that leverages multiple fine-grained adapters to capture domain-specific knowledge while enhancing out-of-domain generalization. Our method uses quality-aware in-domain mutual learning and attention-regularized cross-domain learning to integrate domain-specific insights effectively. Experiments of the CLIP model on three domain-shifting datasets, ImageCLEF-DA, Office-Home, and DomainNet, demonstrate the superior performance of FedAG in both in-domain and out-of-domain scenarios. We envision this work as a milestone for generalizing CLIP to handle the challenge of out-of-domain knowledge under federated learning setting.

Paperid:1500 poster

Authors:Guy Ohayon, Hila Manor, Tomer Michaeli, Michael Elad

Title: Compressed Image Generation with Denoising Diffusion Codebook Models

Abstract:
We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samplesalongwith their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termedDenoising Diffusion Codebook Model(DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered. Code and demo are available on our project'swebsite.

Paperid:1501 poster

Authors:Zhiyang Chen, Hailong Yao, Xia Yin

Title: Learning Configurations for Data-Driven Multi-Objective Optimization

Abstract:
Multi-objective optimization problems arise widely in various fields. In practice, multi-objective optimization is generally solved by heuristics with tunable parameters that are highly application-specific. Tuning parameters based on real-world instances (a.k.a. algorithm configuration) are generally empirical without theoretical guarantees. In this work, we establish the theoretical foundation of data-driven multi-objective optimization through the lens of machine learning theory. We provide generalization guarantees on selecting parameters for multi-objective optimization algorithms based on sampled problem instances. Moreover, if the performance metric of the algorithm is the Pareto volume, we can PAC-learn the approximately optimal configuration in polynomial time. We apply our framework to various algorithms, including approximation algorithms, local search, and linear programming. Experiments on multiple problems verify our theoretical findings.

Paperid:1502 poster

Authors:Xuanlei Zhao, Shenggan Cheng, Chang Chen, Zangwei Zheng, Ziming Liu, Zheming Yang, Yang You

Title: DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

Abstract:
Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.

Paperid:1503 poster

Authors:Jingfeng Zhang, Prashanth Krishnamurthy, Naman Patel, Anthony Tzes, Farshad Khorrami

Title: MP-Nav: Enhancing Data Poisoning Attacks against Multimodal Learning

Abstract:
Despite the success of current multimodal learning at scale, its susceptibility to data poisoning attacks poses security concerns in critical applications. Attacker can manipulate model behavior by injecting maliciously crafted yet minute instances into the training set, stealthily mismatching distinct concepts. Recent studies have manifested the vulnerability by poisoning multimodal tasks such as Text-Image Retrieval (TIR) and Visual Question Answering (VQA). However, the current attacking method only rely on random choice of concepts for misassociation and random instance selections for injecting the poisoning noise, which often achieves the suboptimal effect and even risks failure due to the dilution of poisons by the large number of benign instances. This study introduces MP-Nav (Multimodal Poison Navigator), a plug-and-play module designed to evaluate and even enhance data poisoning attacks against multimodal models. MP-Nav operates at both the concept and instance levels, identifying semantically similar concept pairs and selecting robust instances to maximize the attack efficacy. The experiments corroborate MP-Nav can significantly improve the efficacy of state-of-the-art data poisoning attacks such as AtoB and ShadowCast in multimodal tasks, and maintain model utility across diverse datasets. Notably, this study underscores the vulnerabilities of multimodal models and calls for the counterpart defenses.

Paperid:1504 poster

Authors:Lucas Rosenblatt, Bin HAN, Robert Wolfe, Bill Howe

Title: Fragments to Facts: Partial-Information Fragment Inference from LLMs

Abstract:
Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has "hypertension," under what conditions can they query a model fine-tuned on patient data to learn the patient also has "osteoarthritis?" In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.

Paperid:1505 poster

Authors:Dianwen Ng, Kun Zhou, Yi-Wen Chao, Zhiwei Xiong, Bin Ma, EngSiong Chng

Title: Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

Abstract:
Achieving high-fidelity audio compression while preserving perceptual quality across diverse audio types remains a significant challenge in Neural Audio Coding (NAC). This paper introduces MUFFIN, a fully convolutional NAC framework that leverages psychoacoustically guided multi-band frequency reconstruction. Central to MUFFIN is the Multi-Band Spectral Residual Vector Quantization (MBS-RVQ) mechanism, which quantizes latent speech across different frequency bands. This approach optimizes bitrate allocation and enhances fidelity based on psychoacoustic studies, achieving efficient compression with unique perceptual features that separate content from speaker attributes through distinct codebooks. MUFFIN integrates a transformer-inspired convolutional architecture with proposed modified snake activation functions to capture fine frequency details with greater precision. Extensive evaluations on diverse datasets (LibriTTS, IEMOCAP, GTZAN, BBC) demonstrate MUFFIN’s ability to consistently surpass existing performance in audio reconstruction across various domains. Notably, a high-compression variant achieves an impressive SOTA 12.5 kHz rate while preserving reconstruction quality. Furthermore, MUFFIN excels in downstream generative tasks, demonstrating its potential as a robust token representation for integration with large language models. These results establish MUFFIN as a groundbreaking advancement in NAC and as the first neural psychoacoustic coding system. Speech demos and codes are available at \url{https://demos46.github.io/muffin/}and \url{https://github.com/dianwen-ng/MUFFIN}.

Paperid:1506 poster

Authors:Lorenz Kummer, Samir Moustafa, Anatol Ehrlich, Franka Bause, Nikolaus Suess, Wilfried N. Gansterer, Nils Morten Kriege

Title: Weisfeiler and Leman Go Gambling: Why Expressive Lottery Tickets Win

Abstract:
The lottery ticket hypothesis (LTH) is well-studied for convolutional neural networks but has been validated only empirically for graph neural networks (GNNs), for which theoretical findings are largely lacking. In this paper, we identify the expressivity of sparse subnetworks, i.e. their ability to distinguish non-isomorphic graphs, as crucial for finding winning tickets that preserve the predictive performance. We establish conditions under which the expressivity of a sparsely initialized GNN matches that of the full network, particularly when compared to the Weisfeiler-Leman test, and in that context put forward and prove a Strong Expressive Lottery Ticket Hypothesis. We subsequently show that an increased expressivity in the initialization potentially accelerates model convergence and improves generalization. Our findings establish novel theoretical foundations for both LTH and GNN research, highlighting the importance of maintaining expressivity in sparsely initialized GNNs. We illustrate our results using examples from drug discovery.

Paperid:1507 poster

Authors:Xixi Wu, Yifei Shen, Fangzhou Ge, Caihua Shan, Yizhu Jiao, Xiangguo Sun, Hong Cheng

Title: When Do LLMs Help With Node Classification? A Comprehensive Analysis

Abstract:
Node classification is a fundamental task in graph analysis, with broad applications across various fields. Recent breakthroughs in Large Language Models (LLMs) have enabled LLM-based approaches for this task. Although many studies demonstrate the impressive performance of LLM-based methods, the lack of clear design guidelines may hinder their practical application. In this work, we aim to establish such guidelines through a fair and systematic comparison of these algorithms. As a first step, we developed LLMNodeBed, a comprehensive codebase and testbed for node classification using LLMs. It includes 10 homophilic datasets, 4 heterophilic datasets, 8 LLM-based algorithms, 8 classic baselines, and 3 learning paradigms. Subsequently, we conducted extensive experiments, training and evaluating over 2,700 models, to determine the key settings (e.g., learning paradigms and homophily) and components (e.g., model size and prompt) that affect performance. Our findings uncover 8 insights, e.g., (1) LLM-based methods can significantly outperform traditional methods in a semi-supervised setting, while the advantage is marginal in a supervised setting; (2) Graph Foundation Models can beat open-source LLMs but still fall short of strong LLMs like GPT-4o in a zero-shot setting. We hope that the release of LLMNodeBed, along with our insights, will facilitate reproducible research and inspire future studies in this field. Codes and datasets are released athttps://llmnodebed.github.io/.

Paperid:1508 poster

Authors:Thomas Massena, Léo Andéol, Thibaut Boissin, Franck Mamalet, Corentin Friedrich, Mathieu Serrurier, Sébastien Gerchinovitz

Title: Efficient Robust Conformal Prediction via Lipschitz-Bounded Networks

Abstract:
Conformal Prediction (CP) has proven to be an effective post-hoc method for improving the trustworthiness of neural networks by providing prediction sets with finite-sample guarantees. However, under adversarial attacks, classical conformal guarantees do not hold anymore: this problem is addressed in the field of Robust Conformal Prediction. Several methods have been proposed to provide robust CP sets with guarantees under adversarial perturbations, but, for large scale problems, these sets are either too large or the methods are too computationally demanding to be deployed in real life scenarios. In this work, we propose a new method that leverages Lipschitz-bounded networks to precisely and efficiently estimate robust CP sets. When combined with a 1-Lipschitz robust network, we demonstrate that ourlip-rcpmethod outperforms state-of-the-art results in both the size of the robust CP sets and computational efficiency in medium and large-scale scenarios such as ImageNet. Taking a different angle, we also study vanilla CP under attack, and derive new worst-case coverage bounds of vanilla CP sets, which are valid simultaneously for all adversarial attack levels. Ourlip-rcpmethod makes this second approach as efficient as vanilla CP while also allowing robustness guarantees.

Paperid:1509 poster

Authors:Jian Chen, Zhehao Li, Xiaojie Mao

Title: Learning with Selectively Labeled Data from Multiple Decision-makers

Abstract:
We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. Finally, we theoretically and numerically validate the efficacy of our proposed method.

Paperid:1510 poster

Authors:Jianwei Li, Jung-Eun Kim

Title: Safety Alignment Can Be Not Superficial With Explicit Safety Signals

Abstract:
Recent studies on the safety alignment of large language models (LLMs) have revealed that existing approaches often operate superficially, leaving models vulnerable to various adversarial attacks. Despite their significance, these studies generally fail to offer actionable solutions beyond data augmentation for achieving more robust safety mechanisms. This paper identifies a fundamental cause of this superficiality: existing alignment approaches often presume that models can implicitly learn a safety-related reasoning task during the alignment process, enabling them to refuse harmful requests. However, the learned safety signals are often diluted by other competing objectives, leading models to struggle with drawing a firm safety-conscious decision boundary when confronted with adversarial attacks. Based on this observation, by explicitly introducing a safety-related binary classification task and integrating its signals with our attention and decoding strategies, we eliminate this ambiguity and allow models to respond more responsibly to malicious queries. We emphasize that, with less than 0.2x overhead cost, our approach enables LLMs to assess the safety of both the query and the previously generated tokens at each necessary generating step. Extensive experiments demonstrate that our method significantly improves the resilience of LLMs against various adversarial attacks, offering a promising pathway toward more robust generative AI systems.

Paperid:1511 poster

Authors:Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang

Title: ExLM: Rethinking the Impact of[MASK]Tokens in Masked Language Models

Abstract:
Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with $\texttt{[MASK]}$ tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of $\texttt{[MASK]}$ tokens on MLMs. Analytical studies show that masking tokens can introduce thecorrupted semanticsproblem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $\texttt{[MASK]}$ tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.

Paperid:1512 poster

Authors:Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang

Title: Large Language Models are Demonstration Pre-Selectors for Themselves

Abstract:
In-context learning with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training dataset. However, previous few-shot in-context learning methods, which calculate similarity scores for choosing demonstrations, incur high computational costs by repeatedly retrieving large-scale datasets for each query. This is due to their failure to recognize that not all demonstrations are equally informative, and many less informative demonstrations can be inferred from a core set of highly informative ones. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel \emph{pre-selection} framework that identifies a core subset of demonstrations containing the most informative examples. This subset, referred to as the FEEDER set, consists of demonstrations that capture both the ''sufficiency'' and ''necessity'' information to infer the entire dataset. Notice that FEEDER is selected before the few-shot in-context learning, enabling more efficient few-shot demonstrations choosing in a smaller set. To identify FEEDER, we propose a novel effective tree based algorithm. Once selected, it can replace the original dataset, leading to improved efficiency and prediction accuracy in few-shot in-context learning. Additionally, FEEDER also benefit fine-tuning LLMs, we propose a bi-level optimization method enabling more efficient training without sacrificing performance when datasets become smaller. Our experiments are on 6 text classification datasets, 1 reasoning dataset, and 1 semantic-parsing dataset, across 6 LLMs (ranging from 335M to 7B parameters), demonstrate that: (i) In few-shot inference, FEEDER achieves superior (or comparable) performance while utilizing only half the input training data. (ii) In fine-tuning, FEEDER significantly boosts the performance of LLMs.

Paperid:1513 poster

Authors:Dapeng Jiang, Xiangzhe Kong, Jiaqi Han, Mingyu Li, Rui Jiao, Wenbing Huang, Stefano Ermon, Jianzhu Ma, Yang Liu

Title: Zero-Shot Cyclic Peptide Design via Composable Geometric Constraints

Abstract:
Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing target-specific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38% to 84% on different cyclization strategies.

Paperid:1514 poster

Authors:Fangchen Yu, Yanzhen Chen, Jiaxing Wei, Jianfeng Mao, Wenye Li, Qiang Sun

Title: UltraTWD: Optimizing Ultrametric Trees for Tree-Wasserstein Distance

Abstract:
The Wasserstein distance is a widely used metric for measuring differences between distributions, but its super-cubic time complexity introduces substantial computational burdens. To mitigate this, the tree-Wasserstein distance (TWD) offers a linear-time approximation by leveraging a tree structure; however, existing TWD methods often compromise accuracy due to suboptimal tree structures and edge weights. To address it, we introduce UltraTWD, a novel unsupervised framework that simultaneously optimizes both ultrametric tree structures and edge weights to more faithfully approximate the cost matrix. Specifically, we develop algorithms based on minimum spanning trees, iterative projection, and gradient descent to efficiently learn high-quality ultrametric trees. Empirical results across document retrieval, ranking, and classification tasks demonstrate that UltraTWD achieves superior approximation accuracy and competitive downstream performance. Code is available at:https://github.com/NeXAIS/UltraTWD.

Paperid:1515 poster

Authors:Taoyang Qin, Ke-Jia CHEN, Zheng Liu

Title: Enhancing Spectral GNNs: From Topology and Perturbation Perspectives

Abstract:
Spectral Graph Neural Networks process graph signals using the spectral properties of the normalized graph Laplacian matrix. However, the frequent occurrence of repeated eigenvalues limits the expressiveness of spectral GNNs. To address this, we propose a higher-dimensional sheaf Laplacian matrix, which not only encodes the graph's topological information but also increases the upper bound on the number of distinct eigenvalues. The sheaf Laplacian matrix is derived from carefully designed perturbations of the block form of the normalized graph Laplacian, yielding a perturbed sheaf Laplacian (PSL) matrix with more distinct eigenvalues. We provide a theoretical analysis of the expressiveness of spectral GNNs equipped with the PSL and establish perturbation bounds for the eigenvalues. Extensive experiments on benchmark datasets for node classification demonstrate that incorporating the perturbed sheaf Laplacian enhances the performance of spectral GNNs.

Paperid:1516 poster

Authors:Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson

Title: Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection

Abstract:
Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.

Paperid:1517 poster

Authors:Runquan Gui, Zhihai Wang, Jie Wang, Chi Ma, Huiling Zhen, Mingxuan Yuan, Jianye HAO, Defu Lian, Enhong Chen, Feng Wu

Title: HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

Abstract:
Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning. However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct sub-tasks. To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning. The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner. We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines. Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6$\times$ performance improvement over o1-preview.

Paperid:1518 poster

Authors:Steve Hanneke, Qinglin Meng, Amirreza Shaeiri

Title: Representation Preserving Multiclass Agnostic to Realizable Reduction

Abstract:
We study the problem of multiclass classification when the number of labels can be unbounded within the PAC learning framework. Our main contribution is a theory that demonstrates asimpleandelegantagnostic to realizable reduction for this framework. This resolves an open problem raised by the recent work of (Hopkins et al., 2022). Notably, our result is the firstrepresentation preservingmulticlass agnostic to realizable reduction, in contrast with the compression based approach of the work of (David et al., 2017). Furthermore, our main theorem is stated in an abstract framework, called ``Unified PAC Learning'', which encompasses a range of frameworks, including multiclass PAC learning, list PAC learning, and multilabel PAC learning. In addition, we explore representation preserving reductions to the realizable setting for two noise models, namely Massart noise and Tsybakov noise, in the multiclass PAC learning framework. We believe our technique may find other applications in ensuing studies of theoretical machine learning.

Paperid:1519 poster

Authors:Kei Sen Fong, Mehul Motani

Title: Pareto-Optimal Fronts for Benchmarking Symbolic Regression Algorithms

Abstract:
Symbolic Regression (SR) algorithms select expressions based on prediction performance while also keeping the expression lengths short to produce explainable white box models. In this context, SR algorithms can be evaluated by measuring the extent to which the expressions discovered are Pareto-optimal, in the sense of having the best R-squared score for a given expression length. This evaluation is most commonly done based on relative performance, in the sense that an SR algorithm is judged on whether it Pareto-dominates other SR algorithms selected in the analysis, without any indication on efficiency or attainable limits. In this paper, we explore absolute Pareto-optimal (APO) solutions instead, which have the optimal tradeoff between the multiple SR objectives, for 34 datasets in the widely-used SR benchmark, SRBench, by performing exhaustive search. Additionally, we include comparisons between eight numerical optimization methods. We extract, for every dataset, an APO front of expressions that can serve as a universal baseline for SR algorithms that informs researchers of the best attainable performance for selected sizes. The APO fronts provided serves as an important benchmark and performance limit for SR algorithms and is made publicly available at:https://github.com/kentridgeai/SRParetoFronts

Paperid:1520 poster

Authors:Hanyu Wang, Bochuan Cao, Yuanpu Cao, Jinghui Chen

Title: TruthFlow: Truthful LLM Generation via Representation Flow Correction

Abstract:
Large language models (LLMs) are known to struggle with consistently generating truthful responses. While various representation intervention techniques have been proposed, these methods typically apply a universal representation correction vector to all input queries, limiting their effectiveness against diverse queries in practice. In this study, we introduce TruthFlow, a novel method that leverages the Flow Matching technique for query-specific truthful representation correction. Specifically, TruthFlow first uses a flow model to learn query-specific correction vectors that transition representations from hallucinated to truthful states. Then, during inference, the trained flow model generates these correction vectors to enhance the truthfulness of LLM outputs. Experimental results demonstrate that TruthFlow significantly improves performance on open-ended generation tasks across various advanced LLMs evaluated on TruthfulQA. Moreover, the trained TruthFlow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.

Paperid:1521 poster

Authors:Amer Krivošija, Alexander Munteanu, André Nusser, Chris Schwiegelshohn

Title: Improved Learning via k-DTW: A Novel Dissimilarity Measure for Curves

Abstract:
This paper introduces $k$-Dynamic Time Warping ($k$-DTW), a novel dissimilarity measure for polygonal curves. $k$-DTW has stronger metric properties than Dynamic Time Warping (DTW) and is more robust to outliers than the Fréchet distance, which are the two gold standards of dissimilarity measures for polygonal curves. We show interesting properties of $k$-DTW and give an exact algorithm as well as a $(1+\varepsilon)$-approximation algorithm for $k$-DTW by a parametric search for the $k$-th largest matched distance. We prove the first dimension-free learning bounds for curves and further learning theoretic results. $k$-DTW not only admits smaller sample size than DTW for the problem of learning the median of curves, where some factors depending on the curves' complexity $m$ are replaced by $k$, but we also show a surprising separation on the associated Rademacher and Gaussian complexities: $k$-DTW admits strictly smaller bounds than DTW, by a factor $\tilde\Omega(\sqrt{m})$ when $k\ll m$. We complement our theoretical findings with an experimental illustration of the benefits of using $k$-DTW for clustering and nearest neighbor classification.

Paperid:1522 poster

Authors:Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason E Weston, Tianlu Wang

Title: Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Abstract:
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench and PPE, despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

Paperid:1523 poster

Authors:Viacheslav Meshchaninov, Pavel Strashnov, Andrey Shevtsov, Fedor Nikolaev, Nikita Ivanisenko, Olga Kardymon, Dmitry Vetrov

Title: Diffusion on Language Model Encodings for Protein Sequence Generation

Abstract:
Proteinsequencedesign has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we presentDiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We conduct extensive evaluation of existing methods alongsideDiMAusing multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins.DiMAconsistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design, despite being trained solely on sequence data. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios. Code is released atGitHub.

Paperid:1524 poster

Authors:Jeremy McMahan

Title: Polynomial-Time Approximability of Constrained Reinforcement Learning

Abstract:
We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time $(0,\epsilon)$-additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as $P \neq NP$. The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.

Paperid:1525 poster

Authors:Gabriele D'Acunto, Fabio Massimo Zennaro, Yorgos Felekis, Paolo Di Lorenzo

Title: Causal Abstraction Learning based on the Semantic Embedding Principle

Abstract:
Structural causal models (SCMs) allow us to investigate complex systems at multiple levels of resolution. The causal abstraction (CA) framework formalizes the mapping between high- and low-level SCMs. We address CA learning in a challenging and realistic setting, where SCMs are inaccessible, interventional data is unavailable, and sample data is misaligned. A key principle of our framework issemantic embedding, formalized as the high-level distribution lying on a subspace of the low-level one. This principle naturally links linear CA to the geometry of theStiefel manifold. We present a category-theoretic approach to SCMs that enables the learning of a CA by finding a morphism between the low- and high-level probability measures, adhering to the semantic embedding principle. Consequently, we formulate a general CA learning problem. As an application, we solve the latter problem for linear CA; considering Gaussian measures and the Kullback-Leibler divergence as an objective. Given the nonconvexity of the learning task, we develop three algorithms building upon existing paradigms for Riemannian optimization. We demonstrate that the proposed methods succeed on both synthetic and real-world brain data with different degrees of prior information about the structure of CA.

Paperid:1526 poster

Authors:Muyang Cao, Jiajun Yu, Xin Du, Gang Pan, Wei Wang

Title: MetricEmbedding: Accelerate Metric Nearness by Tropical Inner Product

Abstract:
The Metric Nearness Problem involves restoring a non-metric matrix to its closest metric-compliant form, addressing issues such as noise, missing values, and data inconsistencies. Ensuring metric properties, particularly the $O(N^3)$ triangle inequality constraints, presents significant computational challenges, especially in large-scale scenarios where traditional methods suffer from high time and space complexity. We propose a novel solution based on the tropical inner product (max-plus operation), which we prove satisfies the triangle inequality for non-negative real matrices. By transforming the problem into a continuous optimization task, our method directly minimizes the distance to the target matrix. This approach not only restores metric properties but also generates metric-preserving embeddings, enabling real-time updates and reducing computational and storage overhead for downstream tasks. Experimental results demonstrate that our method achieves up to 60× speed improvements over state-of-the-art approaches, and efficiently scales from $1e4 \times 1e4$ to $1e5 \times 1e5$ matrices with significantly lower memory usage.

Paperid:1527 poster

Authors:Antonio Almudévar, José Miguel Hernández-Lobato, Sameer Khurana, Ricard Marxer, Alfonso Ortega

Title: Aligning Multimodal Representations through an Information Bottleneck

Abstract:
Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that their use is not effective to learn an aligned representation space. In this paper, we argue that this phenomenon is caused by the presence of modality-specific information in the representation space. Although some of the most widely used contrastive losses maximize the mutual information between representations of both modalities, they are not designed to remove the modality-specific information. We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect the emergence of this phenomenon in a controlled experimental setup. Finally, we propose a regularization term in the loss function that is derived by means of a variational approximation and aims to increase the representational alignment. We analyze in a set of controlled experiments and real-world applications the advantages of including this regularization term.

Paperid:1528 poster

Authors:Fanfei Li, Thomas Klein, Wieland Brendel, Robert Geirhos, Roland S. Zimmermann

Title: LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Abstract:
Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

Paperid:1529 poster

Authors:Dongwon Kim, Donghee Kim, Sung Kuk Shyn, Kwangsu Kim

Title: Federated Learning for Feature Generalization with Convex Constraints

Abstract:
Federated learning (FL) often struggles with generalization due to heterogeneous client data. Local models are prone to overfitting their local data distributions, and even transferable features can be distorted during aggregation. To address these challenges, we propose FedCONST, an approach that adaptively modulates update magnitudes based on the global model’s parameter strength. This prevents over-emphasizing well-learned parameters while reinforcing underdeveloped ones. Specifically, FedCONST employs linear convex constraints to ensure training stability and preserve locally learned generalization capabilities during aggregation. A Gradient Signal-to-Noise Ratio (GSNR) analysis further validates FedCONST's effectiveness in enhancing feature transferability and robustness. As a result, FedCONST effectively aligns local and global objectives, mitigating overfitting and promoting stronger generalization across diverse FL environments, achieving state-of-the-art performance.

Paperid:1530 poster

Authors:Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, YuJie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, et al. (8 additional authors not shown)

Title: Cradle: Empowering Foundation Agents towards General Computer Control

Abstract:
Despite their success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory, Cradle is able to understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning and information retrieval, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games (Red Dead Redemption 2, Cities:Skylines, Stardew Valley and Dealer's Life 2), five software applications (Chrome, Outlook, Feishu, Meitu and CapCut), and a comprehensive benchmark, OSWorld. With a unified interface to interact with any software, Cradle greatly extends the reach of foundation agents thus paving the way for generalist agents.

Paperid:1531 poster

Authors:Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, Furu Wei

Title: Imagine While Reasoning in Space: Multimodal Visualization-of-Thought

Abstract:
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.

Paperid:1532 poster

Authors:Max van Spengler, Pascal Mettes

Title: Low-distortion and GPU-compatible Tree Embeddings in Hyperbolic Space

Abstract:
Embedding tree-like data, from hierarchies to ontologies and taxonomies, forms a well-studied problem for representing knowledge across many domains. Hyperbolic geometry provides a natural solution for embedding trees, with vastly superior performance over Euclidean embeddings. Recent literature has shown that hyperbolic tree embeddings can even be placed on top of neural networks for hierarchical knowledge integration in deep learning settings. For all applications, a faithful embedding of trees is needed, with combinatorial constructions emerging as the most effective direction. This paper identifies and solves two key limitations of existing works. First, the combinatorial construction hinges on finding highly separated points on a hypersphere, a notoriously difficult problem. Current approaches achieve poor separation, degrading the quality of the corresponding hyperbolic embedding. We propose highly separated Delaunay tree embeddings (HS-DTE), which integrates angular separation in a generalized formulation of Delaunay embeddings, leading to lower embedding distortion. Second, low-distortion requires additional precision. The current approach for increasing precision is to use multiple precision arithmetic, which renders the embeddings useless on GPUs in deep learning settings. We reformulate the combinatorial construction using floating point expansion arithmetic, leading to superior embedding quality while retaining utility on accelerated hardware.

Paperid:1533 poster

Authors:Brian Mak, Jeffrey Flanigan

Title: Residual Matrix Transformers: Scaling the Size of the Residual Stream

Abstract:
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties.

Paperid:1534 poster

Authors:Wenhao Wang, Yifan Sun, Zongxin Yang, Zhentao Tan, Zhengdong Hu, Yi Yang

Title: Origin Identification for Text-Guided Image-to-Image Diffusion Models

Abstract:
Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused forspreading misinformation,infringing on copyrights, andevading content tracing. This motivates us to introduce the task of originIDentification for text-guidedImage-to-imageDiffusion models (ID$\mathbf{^2}$), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due tovisual discrepancyacross generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset,OriPID, contains abundantOrigins and guidedPrompts, which can be used to train and test potentialIDentification models across various diffusion models. In the method section, we first prove theexistenceof a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can begeneralizedacross different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods (+31.6% mAP), even those with generalization designs. The project is available athttps://id2icml.github.io.

Paperid:1535 poster

Authors:Ricardo Buitrago, Albert Gu

Title: Understanding and Improving Length Generalization in Recurrent Models

Abstract:
Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths---i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the \textit{unexplored states hypothesis}, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all \textit{attainable} states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ($\sim 0.1%$ of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. $2k\longrightarrow 128k$) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.

Paperid:1536 poster

Authors:Sumeet Batra, Gaurav S. Sukhatme

Title: Zero Shot Generalization of Vision-Based RL Without Data Augmentation

Abstract:
Generalizing vision-based reinforcement learning (RL) agents to novel environments remains a difficult and open challenge. Current trends are to collect large-scale datasets or use data augmentation techniques to prevent overfitting and improve downstream generalization. However, the computational and data collection costs increase exponentially with the number of task variations and can destabilize the already difficult task of training RL agents. In this work, we take inspiration from recent advances in computational neuroscience and propose a model, Associative Latent DisentAnglement (ALDA), that builds on standard off-policy RL towards zero-shot generalization. Specifically, we revisit the role of latent disentanglement in RL and show how combining it with a model of associative memory achieves zero-shot generalization on difficult task variationswithoutrelying on data augmentation. Finally, we formally show that data augmentation techniques are a form of weak disentanglement and discuss the implications of this insight.

Paperid:1537 poster

Authors:Zi-Jian Cheng, Ziyi Jia, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

Title: TabFSBench: Tabular Benchmark for Feature Shifts in Open Environments

Abstract:
Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the firsttabularfeature-shiftbenchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation.Benchmark:LAMDASZ-ML/TabFSBench.

Paperid:1538 poster

Authors:Sheyang Tang, Xiaoyu Xu, Jiayan Qiu, Zhou Wang

Title: Controllable Data Generation with Hierarchical Neural Representations

Abstract:
Implicit Neural Representations (INRs) represent data as continuous functions using the parameters of a neural network, where data information is encoded in the parameter space. Therefore, modeling the distribution of such parameters is crucial for building generalizable INRs. Existing approaches learn a joint distribution of these parameters via a latent vector to generate new data, but such a flat latent often fails to capture the inherent hierarchical structure of the parameter space, leading to entangled data semantics and limited control over the generation process. Here, we propose aControllableHierarchicalImplicitNeuralRepresentation (CHINR) framework, which explicitly models conditional dependencies across layers in the parameter space. Our method consists of two stages: In Stage-1, we construct a Layers-of-Experts (LoE) network, where each layer modulates distinct semantics through a unique latent vector, enabling disentangled and expressive representations. In Stage-2, we introduce a Hierarchical Conditional Diffusion Model (HCDM) to capture conditional dependencies across layers, allowing for controllable and hierarchical data generation at various semantic granularities. Extensive experiments across different modalities demonstrate that CHINR improves generalizability and offers flexible hierarchical control over the generated content.

Paperid:1539 poster

Authors:Hongli Zhan, Muneeza Azmat, Raya Horesh, Junyi Jessy Li, Mikhail Yurochkin

Title: SPRI: Aligning Large Language Models with Context-Situated Principles

Abstract:
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness.

Paperid:1540 poster

Authors:Zhenting Wang, Chen Chen, Vikash Sehwag, Minzhou Pan, Lingjuan Lyu

Title: How to Evaluate and Mitigate IP Infringement in Visual Generative AI?

Abstract:
The popularity of visual generative AI models like DALL-E 3, Stable Diffusion XL, Stable Video Diffusion, and Sora has been increasing. Through extensive evaluation, we discovered that the state-of-the-art visual generative models can generate content that bears a striking resemblance to characters protected by intellectual property rights held by major entertainment companies (such as Sony, Marvel, and Nintendo), which raises potential legal concerns. This happens when the input prompt contains the character's name or even just descriptive details about their characteristics. To mitigate such IP infringement problems, we also propose a defense method against it. In detail, we develop a revised generation paradigm that can identify potentially infringing generated content and prevent IP infringement by utilizing guidance techniques during the diffusion process. It has the capability to recognize generated content that may be infringing on intellectual property rights, and mitigate such infringement by employing guidance methods throughout the diffusion process without retrain or fine-tune the pretrained models. Experiments on well-known character IPs like Spider-Man, Iron Man, and Superman demonstrate the effectiveness of the proposed defense method.

Paperid:1541 poster

Authors:Hancheng Min, Rene Vidal

Title: Gradient Flow Provably Learns Robust Classifiers for Orthonormal GMMs

Abstract:
Deep learning-based classifiers are known to be vulnerable to adversarial attacks. Existing methods for defending against such attacks require adding a defense mechanism or modifying the learning procedure (e.g., by adding adversarial examples). This paper shows that for certain data distributions one can learn a provably robust classifier using standard learning methods and without adding a defense mechanism. More specifically, this paper addresses the problem of finding a robust classifier for a binary classification problem in which the data comes from an isotropic mixture of Gaussians with orthonormal cluster centers. First, we characterize the largest $\ell_2$-attack any classifier can defend against while maintaining high accuracy, and show the existence of optimal robust classifiers achieving this maximum $\ell_2$-robustness. Next, we show that given data from the orthonormal Gaussian mixture model, gradient flow on a two-layer network with a polynomial ReLU activation and without adversarial examples provably finds an optimal robust classifier.

Paperid:1542 poster

Authors:Yanbo Wang, Justin Dauwels, Yilun Du

Title: Compositional Scene Understanding through Inverse Generative Modeling

Abstract:
Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are athttps://energy-based-model.github.io/compositional-inference.

Paperid:1543 poster

Authors:Akhil Jalan, Yassir Jedra, Arya Mazumdar, Soumendu Sundar Mukherjee, Purnamrita Sarkar

Title: Optimal Transfer Learning for Missing Not-at-Random Matrix Completion

Abstract:
We study transfer learning for matrix completion in a Missing Not-at-Random (MNAR) setting that is motivated by biological problems. The target matrix $Q$ has entire rows and columns missing, making estimation impossible without side information. To address this, we use a noisy and incomplete source matrix $P$, which relates to $Q$ via a feature shift in latent space. We consider both theactiveandpassivesampling of rows and columns. We establish minimax lower bounds for entrywise estimation error in each setting. Our computationally efficient estimation framework achieves this lower bound for the active setting, which leverages the source data to query the most informative rows and columns of $Q$. This avoids the need forincoherenceassumptions required for rate optimality in the passive sampling setting. We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets.

Paperid:1544 poster

Authors:Zeyu Tang, Zhenhao Chen, Xiangchen Song, Loka Li, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang

Title: Reflection-Window Decoding: Text Generation with Selective Refinement

Abstract:
The autoregressive decoding for text generation in large language models (LLMs), while widely used, is inherently suboptimal due to the lack of a built-in mechanism to perform refinement and/or correction of the generated content. In this paper, we consider optimality in terms of the joint probability over the generated response, when jointly considering all tokens at the same time. We theoretically characterize the potential deviation of the autoregressively generated response from its globally optimal counterpart that is of the same length. Our analysis suggests that we need to be cautious when noticeable uncertainty arises during text generation, which may signal the sub-optimality of the generation history. To address the pitfall of autoregressive decoding for text generation, we propose an approach that incorporates a sliding reflection window and a pausing criterion, such that refinement and generation can be carried out interchangeably as the decoding proceeds. Our selective refinement framework strikes a balance between efficiency and optimality, and our extensive experimental results demonstrate the effectiveness of our approach.

Paperid:1545 poster

Authors:Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang

Title: QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Abstract:
Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.

Paperid:1546 poster

Authors:Aaditya Naik, Jason Liu, Claire Wang, Amish Sethi, Saikat Dutta, Mayur Naik, Eric Wong

Title: DOLPHIN: A Programmable Framework for Scalable Neurosymbolic Learning

Abstract:
Neurosymbolic learning enables the integration of symbolic reasoning with deep learning but faces significant challenges in scaling to complex symbolic programs, large datasets, or both. We introduce DOLPHIN, a framework that tackles these challenges by supporting neurosymbolic programs in Python, executing complex symbolic reasoning on the CPU while vectorizing probabilistic computations and gradient propagation on the GPU. Across 13 benchmarks spanning tasks over text, image, and video data, with symbolic reasoning features like recursion and blackbox functions, DOLPHIN converges to state-of-the-art accuracies on the more complex benchmarks while existing frameworks such as Scallop, ISED, and IndeCateR+ fail to converge within the time limit. On simpler benchmarks, DOLPHIN matches their performance, while achieving these results 1.71x to 62x faster than the baselines. Overall, DOLPHIN advances the scalability of neurosymbolic frameworks, achieving state-of-the-art efficiency and convergence on difficult benchmarks where existing frameworks struggle. The code is published athttps://github.com/Dolphin-NeSy/Dolphin.

Paperid:1547 poster

Authors:Andrew Wagenmaker, Zhiyuan Zhou, Sergey Levine

Title: Behavioral Exploration: Learning to Explore via In-Context Adaptation

Abstract:
Developing autonomous agents that quickly explore an environment and adapt their behavior online is a canonical challenge in robotics and machine learning. While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradient-based behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose behavioral exploration: training agents to internalize what it means to explore and adapt in-context over the space of ''expert'' behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how ''exploratory'' the expert's behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, ''expert-like'' exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.

Paperid:1548 poster

Authors:Tak Hur, Daniel K. Park

Title: Understanding Generalization in Quantum Machine Learning with Margins

Abstract:
Understanding and improving generalization capabilities is crucial for both classical and quantum machine learning (QML). Recent studies have revealed shortcomings in current generalization theories, particularly those relying on uniform bounds, across both classical and quantum settings. In this work, we present a margin-based generalization bound for QML models, providing a more reliable framework for evaluating generalization. Our experimental studies on the quantum phase recognition dataset demonstrate that margin-based metrics are strong predictors of generalization performance, outperforming traditional metrics like parameter count. By connecting this margin-based metric to quantum information theory, we demonstrate how to enhance the generalization performance of QML through a classical-quantum hybrid approach when applied to classical data.

Paperid:1549 poster

Authors:Jinyang Li, Nan Huo, Yan Gao, Jiayi Shi, Yingxiu Zhao, Ge Qu, Bowen Qin, Yurong Wu, Xiaodong Li, Chenhao Ma, Jian-Guang Lou, Reynold Cheng

Title: Are Large Language Models Ready for Multi-Turn Tabular Data Analysis?

Abstract:
Conversational Tabular Data Analysis, a collaboration between humans and machines, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic conversational logs for tabular data analysis hinder comprehensive quantitative evaluation of Large Language Models (LLMs) in this task. To mitigate this issue, we introduceCoTA, a new benchmark to evaluate LLMs on conversational tabular data analysis.CoTAcontains 1013 conversations, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably,CoTAis constructed by an economical multi-agent environment, Decision Company, with few human efforts. This environment ensures efficiency and scalability of generating new conversational data. Our comprehensive study, conducted by data analysis experts, demonstrates that Decision Company is capable of producing diverse and high-quality data, laying the groundwork for efficient data annotation. We evaluate popular and advanced LLMs inCoTA, which highlights the challenges of conversational tabular data analysis. Furthermore, we propose Adaptive Conversation Reflection (ACR), a self-generated reflection strategy that guides LLMs to learn from successful histories. Experiments demonstrate that ACR can evolve LLMs into effective conversational data analysis agents, achieving a relative performance improvement of up to 35.14%.

Paperid:1550 poster

Authors:Zhuoran Zhang, Yongxiang Li, Zijian Kan, Keyuan Cheng, Lijie Hu, Di Wang

Title: Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing

Abstract:
The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve knowledge with implicit subject information from deeper MLP layers, unlike single-hop tasks, which rely on shallow layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers with single-hop edit prompts, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. Beyond single-hop editing prompts, IFMET further incorporates multi-hop editing prompts to locate and modify knowledge across different stages of reasoning. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, overcoming the limitations of previous locate-then-edit methods.

Paperid:1551 poster

Authors:Aya Kayal, Sattar Vakili, Laura Toni, Da-shan Shiu, Alberto Bernacchia

Title: Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds

Abstract:
Bayesian optimization (BO) with preference-based feedback has recently garnered significant attention due to its emerging applications. We refer to this problem as Bayesian Optimization from Human Feedback (BOHF), which differs from conventional BO by learning the best actions from a reduced feedback model, where only the preference between two actions is revealed to the learner at each time step. The objective is to identify the best action using a limited number of preference queries, typically obtained through costly human feedback. Existing work, which adopts the Bradley-Terry-Luce (BTL) feedback model, provides regret bounds for the performance of several algorithms. In this work, within the same framework we develop tighter performance guarantees. Specifically, we derive regret bounds ofO~(Γ(T)T), whereΓ(T)represents the maximum information gain—a kernel-specific complexity term—andTis the number of queries. Our results significantly improve upon existing bounds. Notably, for common kernels, we show that the order-optimal sample complexities of conventional BO—achieved with richer feedback models—are recovered. In other words, the same number of preferential samples as scalar-valued samples is sufficient to find a nearly optimal solution.

Paperid:1552 poster

Authors:Paul Mangold, Alain Oliviero Durmus, Aymeric Dieuleveut, Eric Moulines

Title: Scaffold with Stochastic Gradients: New Analysis with Linear Speed-Up

Abstract:
This paper proposes a novel analysis for the Scaffold algorithm, a popular method for dealing with data heterogeneity in federated learning. While its convergence in deterministic settings—where local control variates mitigate client drift—is well established, the impact of stochastic gradient updates on its performance is less understood. To address this problem, we first show that its global parameters and control variates define a Markov chain that converges to a stationary distribution in the Wasserstein distance. Leveraging this result, we prove that Scaffold achieves linear speed-up in the number of clients up to higher-order terms in the step size. Nevertheless, our analysis reveals that Scaffold retains a higher-order bias, similar to FedAvg, that does not decrease as the number of clients increases. This highlights opportunities for developing improved stochastic federated learning algorithms.

Paperid:1553 poster

Authors:Ethan Rathbun, Alina Oprea, Christopher Amato

Title: Adversarial Inception Backdoor Attacks against Reinforcement Learning

Abstract:
Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent's rewards, inducing values far outside the environment's natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These ``inception'' attacks manipulate the agent's training data -- inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent's task performance.

Paperid:1554 poster

Authors:Alessandro Licciardi, Davide Leo, Eros Fanì, Barbara Caputo, Marco Ciccone

Title: Interaction-Aware Gaussian Weighting for Clustered Federated Learning

Abstract:
Federated Learning (FL) emerged as a decentralized paradigm to train models while preserving privacy. However, conventional FL struggles with data heterogeneity and class imbalance, which degrade model performance. Clustered FL balances personalization and decentralized training by grouping clients with analogous data distributions, enabling improved accuracy while adhering to privacy constraints. This approach effectively mitigates the adverse impact of heterogeneity in FL. In this work, we propose a novel clustering method for FL,FedGWC(Federated Gaussian Weighting Clustering), which groups clients based on their data distribution, allowing training of a more robust and personalized model on the identified clusters.FedGWCidentifies homogeneous clusters by transforming individual empirical losses to model client interactions with a Gaussian reward mechanism. Additionally, we introduce theWasserstein Adjusted Score, a new clustering metric for FL to evaluate cluster cohesion with respect to the individual class distribution. Our experiments on benchmark datasets show thatFedGWCoutperforms existing FL algorithms in cluster quality and classification accuracy, validating the efficacy of our approach.

Paperid:1555 poster

Authors:Pengxiang Zhao, Xiaoming Yuan

Title: GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Abstract:
Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$\times$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

Paperid:1556 poster

Authors:Idan Achituve, Hai Victor Habi, Amir Rosenfeld, Arnon Netzer, Idit Diamant, Ethan Fetaya

Title: Inverse Problem Sampling in Latent Space Using Sequential Monte Carlo

Abstract:
In image processing, solving inverse problems is the task of finding plausible reconstructions of an image that was corrupted by some (usually known) degradation operator. Commonly, this process is done using a generative image model that can guide the reconstruction towards solutions that appear natural. The success of diffusion models over the last few years has made them a leading candidate for this task. However, the sequential nature of diffusion models makes this conditional sampling process challenging. Furthermore, since diffusion models are often defined in the latent space of an autoencoder, the encoder-decoder transformations introduce additional difficulties. To address these challenges, we suggest a novel sampling method based on sequential Monte Carlo (SMC) in the latent space of diffusion models. We name our method LD-SMC. We define a generative model for the data using additional auxiliary observations and perform posterior inference with SMC sampling based on a backward diffusion process. Empirical evaluations on ImageNet and FFHQ show the benefits of LD-SMC over competing methods in various inverse problem tasks and especially in challenging inpainting tasks.

Paperid:1557 poster

Authors:Wen-Bo Du, Hao-Yi Lei, Lue Tao, Tian-Zuo Wang, Zhi-Hua Zhou

Title: Enabling Optimal Decisions in Rehearsal Learning under CARE Condition

Abstract:
In the field of machine learning (ML), an essential type of decision-related problem is known as AUF (Avoiding Undesired Future): if an ML model predicts an undesired outcome, how can decisions be made to prevent it? Recently, a novel framework calledrehearsal learninghas been proposed to address the AUF problem. Despite its utility in modeling uncertainty for decision-making, it remains unclearunder what conditionsandhowoptimal actions that maximize theAUF probabilitycan be identified. In this paper, we proposeCARE(CAnonical REctangle), a condition under which the maximum AUF probability can be achieved. Under the CARE condition, we present a projection-Newton algorithm to select actions and prove that the algorithm achieves superlinear convergence to the optimal one. Besides, we provide a generalization method for adopting the algorithm to AUF scenarios beyond the CARE condition. Finally, we demonstrate that a closed-form solution exists when the outcome is a singleton variable, substantially reducing the time complexity of decision-making. Experiments validate the effectiveness and efficiency of our method.

Paperid:1558 poster

Authors:He Li, Haoang Chi, Mingyu Liu, Wanrong Huang, Liyang Xu, Wenjing Yang

Title: Transformer-Based Spatial-Temporal Counterfactual Outcomes Estimation

Abstract:
The real world naturally has dimensions of time and space. Therefore, estimating the counterfactual outcomes with spatial-temporal attributes is a crucial problem. However, previous methods are based on classical statistical models, which still have limitations in performance and generalization. This paper proposes a novel framework for estimating counterfactual outcomes with spatial-temporal attributes using the Transformer, exhibiting stronger estimation ability. Under mild assumptions, the proposed estimator within this framework is consistent and asymptotically normal. To validate the effectiveness of our approach, we conduct simulation experiments and real data experiments. Simulation experiments show that our estimator has a stronger estimation capability than baseline methods. Real data experiments provide a valuable conclusion to the causal effect of conflicts on forest loss in Colombia. The source code is available at thisURL.

Paperid:1559 poster

Authors:Ofri Eisen, Ron Dorfman, Kfir Yehuda Levy

Title: Enhancing Parallelism in Decentralized Stochastic Convex Optimization

Abstract:
Decentralized learning has emerged as a powerful approach for handling large datasets across multiple machines in a communication-efficient manner. However, such methods often face scalability limitations, as increasing the number of machines beyond a certain point negatively impacts convergence rates. In this work, we proposeDecentralized Anytime SGD, a novel decentralized learning algorithm that significantly extends the critical parallelism threshold, enabling the effective use of more machines without compromising performance. Within the stochastic convex optimization (SCO) framework, we establish a theoretical upper bound on parallelism that surpasses the current state-of-the-art, allowing larger networks to achieve favorable statistical guarantees and closing the gap with centralized learning in highly connected topologies.

Paperid:1560 poster

Authors:Mohammad Jalali, Bahar Dibaei Nia, Farzan Farnia

Title: Towards an Explainable Comparison and Alignment of Feature Embeddings

Abstract:
While several feature embedding models have been developed in the literature, comparisons of these embeddings have largely focused on their numerical performance in classification-related downstream applications. However, an interpretable comparison of different embeddings requires identifying and analyzing mismatches between sample groups clustered within the embedding spaces. In this work, we propose the Spectral Pairwise Embedding Comparison (SPEC) framework to compare embeddings and identify their differences in clustering a reference dataset. Our approach examines the kernel matrices derived from two embeddings and leverages the eigendecomposition of the difference kernel matrix to detect sample clusters that are captured differently by the two embeddings. We present a scalable implementation of this kernel-based approach, with computational complexity that grows linearly with the sample size. Furthermore, we introduce an optimization problem using this framework to align two embeddings, ensuring that clusters identified in one embedding are also captured in the other model. We provide numerical results demonstrating the SPEC's application to compare and align embeddings on large-scale datasets such as ImageNet and MS-COCO. The code is available athttps://github.com/mjalali/embedding-comparison.

Paperid:1561 poster

Authors:Sijia Zhang, Shuli Zeng, Shaoang Li, Feng Wu, Shaojie Tang, Xiangyang Li

Title: Don't Restart, Just Reuse: Reoptimizing MILPs with Dynamic Parameters

Abstract:
Many real-world applications, such as logistics, routing, scheduling, and production planning, involve dynamic systems that require continuous updates to solutions for new Mixed Integer Linear Programming (MILP) problems. These systems often require rapid updates to their solutions to accommodate slight modifications in constraints or objectives introduced by evolving conditions. While reoptimization techniques have been explored for Linear Programming (LP) and certain specific MILP problems, their effectiveness in addressing general MILP is limited. In this work, we propose a two-stage reoptimization framework for efficiently identifying high-quality feasible solutions. Specifically, we first utilize the historical solving process information to predict a high confidence solution space for modified MILPs, which is likely to contain high-quality solutions. Building on the prediction results, we fix a part of variables within the predicted intervals and apply the Thompson Sampling algorithm to determine which variables to fix. This is done by updating the Beta distributions based on the solutions obtained from the solver. Extensive experiments across nine reoptimization datasets show that our VP-OR outperforms the state-of-the-art methods, achieving higher-quality solutions under strict time limits.

Paperid:1562 poster

Authors:Zecheng Tang, Zechen Sun, Juntao Li, Qiaoming Zhu, Min Zhang

Title: LOGO --- Long cOntext aliGnment via efficient preference Optimization

Abstract:
Long-context models (LCMs) have shown great potential in processing long input sequences (even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO (Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8 x A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model's original capabilities on other tasks, e.g., language modeling and MMLU. Moreover, LOGO can extend the model's context window size while enhancing its generation performance.

Paperid:1563 poster

Authors:Mingzhen He, Ruikai Yang, Hanling Tian, Youmei Qiu, Xiaolin Huang

Title: Primphormer: Efficient Graph Transformers with Primal Representations

Abstract:
Graph Transformers (GTs) have emerged as a promising approach for graph representation learning. Despite their successes, the quadratic complexity of GTs limits scalability on large graphs due to their pair-wise computations. To fundamentally reduce the computational burden of GTs, we propose a primal-dual framework that interprets the self-attention mechanism on graphs as a dual representation. Based on this framework, we develop Primphormer, an efficient GT that leverages a primal representation with linear complexity. Theoretical analysis reveals that Primphormer serves as a universal approximator for functions on both sequences and graphs, while also retaining its expressive power for distinguishing non-isomorphic graphs. Extensive experiments on various graph benchmarks demonstrate that Primphormer achieves competitive empirical results while maintaining a more user-friendly memory and computational costs.

Paperid:1564 poster

Authors:HaoTian Ni, Yake Wei, Hang Liu, Gong Chen, Chong Peng, Hao Lin, Di Hu

Title: RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

Abstract:
Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available athttps://github.com/GeWu-Lab/RollingQ_ICML2025.

Paperid:1565 poster

Authors:Zixuan Hu, Yichun Hu, Xiaotong Li, SHIXIANG TANG, LINGYU DUAN

Title: Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation

Abstract:
Wild Test-Time Adaptation (WTTA) is proposed to adapt a source model to unseen domains under extreme data scarcity and multiple shifts. Previous approaches mainly focused on sample selection strategies, while overlooking the fundamental problem on underlying optimization. Initially, we critically analyze the widely-adopted entropy minimization framework in WTTA and uncover its significant limitations in noisy optimization dynamics that substantially hinder adaptation efficiency. Through our analysis, we identify region confidence as a superior alternative to traditional entropy, however, its direct optimization remains computationally prohibitive for real-time applications. In this paper, we introduce a novel region-integrated methodReCAPthat bypasses the lengthy process. Specifically, we propose a probabilistic region modeling scheme that flexibly captures semantic changes in embedding space. Subsequently, we develop a finite-to-infinite asymptotic approximation that transforms the intractable region confidence into a tractable and upper-bounded proxy. These innovations significantly unlock the overlooked potential dynamics in local region in a concise solution. Our extensive experiments demonstrate the consistent superiority of ReCAP over existing methods across various datasets and wild scenarios. The source code will be available athttps://github.com/hzcar/ReCAP.

Paperid:1566 poster

Authors:Andong Wang, Yuning Qiu, Zhong Jin, Guoxu Zhou, Qibin Zhao

Title: Low-Rank Tensor Transitions (LoRT) for Transferable Tensor Regression

Abstract:
Tensor regression is a powerful tool for analyzing complex multi-dimensional data in fields such as neuroimaging and spatiotemporal analysis, but its effectiveness is often hindered by insufficient sample sizes. To overcome this limitation, we adopt a transfer learning strategy that leverages knowledge from related source tasks to improve performance in data-scarce target tasks. This approach, however, introduces additional challenges including model shifts, covariate shifts, and decentralized data management. We propose the Low-Rank Tensor Transitions (LoRT) framework, which incorporates a novel fusion regularizer and a two-step refinement to enable robust adaptation while preserving low-tubal-rank structure. To support decentralized scenarios, we extend LoRT to D-LoRT, a distributed variant that maintains statistical efficiency with minimal communication overhead. Theoretical analysis and experiments on tensor regression tasks, including compressed sensing and completion, validate the robustness and versatility of the proposed methods. These findings indicate the potential of LoRT as a robust method for tensor regression in settings with limited data and complex distributional structures.

Paperid:1567 poster

Authors:Hui Lan, Haoge Chang, Eleanor Wiske Dillon, Vasilis Syrgkanis

Title: A Meta-learner for Heterogeneous Effects in Difference-in-Differences

Abstract:
We address the problem of estimating heterogeneous treatment effects in panel data, adopting the popular Difference-in-Differences (DiD) framework under the conditional parallel trends assumption. We propose a novel doubly robust meta-learner for the Conditional Average Treatment Effect on the Treated (CATT), reducing the estimation to a convex risk minimization problem involving a set of auxiliary models. Our framework allows for the flexible estimation of the CATT, when conditioning on any subset of variables of interest using generic machine learning. Leveraging Neyman orthogonality, our proposed approach is robust to estimation errors in the auxiliary models. As a generalization to our main result, we develop a meta-learning approach for the estimation of general conditional functionals under covariate shift. We also provide an extension to the instrumented DiD setting with non-compliance. Empirical results demonstrate the superiority of our approach over existing baselines.

Paperid:1568 poster

Authors:Jingyi Cui, Hongwei Wen, Yisen Wang

Title: An Augmentation-Aware Theory for Self-Supervised Contrastive Learning

Abstract:
Self-supervised contrastive learning has emerged as a powerful tool in machine learning and computer vision to learn meaningful representations from unlabeled data. Meanwhile, its empirical success has encouraged many theoretical studies to reveal the learning mechanisms. However, in the existing theoretical research, the role of data augmentation is still under-exploited, especially the effects of specific augmentation types. To fill in the blank, we for the first time propose an augmentation-aware error bound for self-supervised contrastive learning, showing that the supervised risk is bounded not only by the unsupervised risk, but also explicitly by a trade-off induced by data augmentation. Then, under a novel semantic label assumption, we discuss how certain augmentation methods affect the error bound. Lastly, we conduct both pixel- and representation-level experiments to verify our proposed theoretical results.

Paperid:1569 poster

Authors:Montgomery Bohde, Mrunali Manjrekar, Runzhong Wang, Shuiwang Ji, Connor W. Coley

Title: DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

Abstract:
Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditionalde novogeneration of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models onde novomolecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available athttps://github.com/coleygroup/DiffMS.

Paperid:1570 poster

Authors:Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang

Title: RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing

Abstract:
Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability.This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit athttps://github.com/PurCL/RepoAudit.

Paperid:1571 poster

Authors:Zhikang Chen, Abudukelimu Wuerkaixi, Sen Cui, Haoxuan Li, Ding Li, Jingfeng Zhang, Bo Han, Gang Niu, Houfang Liu, Yi Yang, Sifan YANG, Changshui Zhang, Tianling Ren

Title: Learning without Isolation: Pathway Protection for Continual Learning

Abstract:
Deep networks are prone to catastrophic forgetting during sequential task learning, i.e., losing the knowledge about old tasks upon learning new tasks. To this end, continual learning (CL) has emerged, whose existing methods focus mostly on regulating or protecting the parameters associated with the previous tasks. However, parameter protection is often impractical, since the size of parameters for storing the old-task knowledge increases linearly with the number of tasks, otherwise it is hard to preserve the parameters related to the old-task knowledge. In this work, we bring a dual opinion from neuroscience and physics to CL: in the whole networks, the pathways matter more than the parameters when concerning the knowledge acquired from the old tasks. Following this opinion, we propose a novel CL framework, learning without isolation (LwI), where model fusion is formulated as graph matching and the pathways occupied by the old tasks are protected without being isolated. Thanks to the sparsity of activation channels in a deep network, LwI can adaptively allocate available pathways for a new task, realizing pathway protection and addressing catastrophic forgetting in a parameter-effcient manner. Experiments on popular benchmark datasets demonstrate the superiority of the proposed LwI.

Paperid:1572 poster

Authors:Chengying Fang, Wenke Huang, Guancheng Wan, Yihao Yang, Mang Ye

Title: FedPHA: Federated Prompt Learning for Heterogeneous Client Adaptation

Abstract:
Federated Prompt Learning (FPL) adapts pre-trained Vision-Language Models (VLMs) to federated learning through prompt tuning, leveraging their transferable representations and strong generalization capabilities. Traditional methods often require uniform prompt lengths for federated aggregation, limiting adaptability to clients with diverse prompt lengths and distribution biases. In this paper, we proposeFederatedPrompt Learning forHeterogeneous ClientAdaptation (FedPHA), a novel framework that combines a fixed-length global prompt for efficient aggregation with local prompts of varying lengths to capture client-specific data characteristics. Additionally, FedPHA designs Singular Value Decomposition (SVD) based projection and bidirectional alignment to disentangle global conflicts arising from client heterogeneity, ensuring that personalized client tasks effectively utilize non-harmful global knowledge. This approach ensures that global knowledge improves model generalization while local knowledge preserves local optimization. Experimental results validate the effectiveness of FedPHA in achieving a balance between global and personalized knowledge in federated learning scenarios.

Paperid:1573 poster

Authors:Anish Dhir, Ruby Sedgwick, Avinash Kori, Ben Glocker, Mark van der Wilk

Title: Continuous Bayesian Model Selection for Multivariate Causal Discovery

Abstract:
Current causal discovery approaches require restrictive model assumptions in the absence of interventional data to ensure structure identifiability. These assumptions often do not hold in real-world applications leading to a loss of guarantees and poor performance in practice. Recent work has shown that, in the bivariate case, Bayesian model selection can greatly improve performance by exchanging restrictive modelling for more flexible assumptions, at the cost of a small probability of making an error. Our work shows that this approach is useful in the important multivariate case as well. We propose a scalable algorithm leveraging a continuous relaxation of the discrete model selection problem. Specifically, we employ the Causal Gaussian Process Conditional Density Estimator (CGP-CDE) as a Bayesian non-parametric model, using its hyperparameters to construct an adjacency matrix. This matrix is then optimised using the marginal likelihood and an acyclicity regulariser, giving the maximum a posteriori causal graph. We demonstrate the competitiveness of our approach, showing it is advantageous to perform multivariate causal discovery without infeasible assumptions using Bayesian model selection.

Paperid:1574 poster

Authors:Yusheng Zhao, Chi Zhang, Yuxuan Du

Title: Rethink the Role of Deep Learning towards Large-scale Quantum Systems

Abstract:
Characterizing the ground state properties of quantum systems is fundamental to capturing their behavior but computationally challenging. Recent advances in AI have introduced novel approaches, with diverse machine learning (ML) and deep learning (DL) models proposed for this purpose. However, the necessity and specific role of DL models in these tasks remain unclear, as prior studies often employ varied or impractical quantum resources to construct datasets, resulting in unfair comparisons. To address this, we systematically benchmark DL models against traditional ML approaches across three families of Hamiltonian, scaling up to127qubits in three crucial ground-state learning tasks while enforcing equivalent quantum resource usage. Our results reveal that ML models often achieve performance comparable to or even exceeding that of DL approaches across all tasks. Furthermore, a randomization test demonstrates that measurement input features have minimal impact on DL models' prediction performance. These findings challenge the necessity of current DL models in many quantum system learning scenarios and provide valuable insights into their effective utilization.

Paperid:1575 poster

Authors:Yu Zhang, Shanshan Zhao, Bokui Wan, Jinjuan Wang, Xiaodong Yan

Title: Strategic A/B testing via Maximum Probability-driven Two-armed Bandit

Abstract:
Detecting a minor average treatment effect is a major challenge in large-scale applications, where even minimal improvements can have a significant economic impact. Traditional methods, reliant on normal distribution-based or expanded statistics, often fail to identify such minor effects because of their inability to handle small discrepancies with sufficient sensitivity. This work leverages a counterfactual outcome framework and proposes a maximum probability-driven two-armed bandit (TAB) process by weighting the mean volatility statistic, which controls Type I error. The implementation of permutation methods further enhances the robustness and efficacy. The established strategic central limit theorem (SCLT) demonstrates that our approach yields a more concentrated distribution under the null hypothesis and a less concentrated one under the alternative hypothesis, greatly improving statistical power. The experimental results indicate a significant improvement in the A/B testing, highlighting the potential to reduce experimental costs while maintaining high statistical power.

Paperid:1576 poster

Authors:Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, Afshin Dehghan

Title: FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Abstract:
We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.

Paperid:1577 poster

Authors:Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, Yanghua Xiao

Title: CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

Abstract:
Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively. Our code, dataset and models are available at:https://github.com/Neph0s/CoSER.

Paperid:1578 poster

Authors:Wenwen Qiang, Jingyao Wang, Zeen Song, Jiangmeng Li, Changwen Zheng

Title: On the Out-of-Distribution Generalization of Self-Supervised Learning

Abstract:
In this paper, we focus on the out-of-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during the SSL training phase, we first give one plausible explanation for SSL having OOD generalization. Then, from the perspective of data generation and causal inference, we analyze and conclude that SSL learns spurious correlations during the training process, which leads to a reduction in OOD generalization. To address this issue, we propose a post-intervention distribution (PID) grounded in the Structural Causal Model. PID offers a scenario where the spurious variable and label variable is mutually independent. Besides, we demonstrate that if each mini-batch during SSL training satisfies PID, the resulting SSL model can achieve optimal worst-case OOD performance. This motivates us to develop a batch sampling strategy that enforces PID constraints through the learning of a latent variable model. Through theoretical analysis, we demonstrate the identifiability of the latent variable model and validate the effectiveness of the proposed sampling strategy. Experiments conducted on various downstream OOD tasks demonstrate the effectiveness of the proposed sampling strategy.

Paperid:1579 poster

Authors:Zhongweiyang Xu, Xulin Fan, Zhong-Qiu Wang, Xilin Jiang, Romit Roy Choudhury

Title: ArrayDPS: Unsupervised Blind Speech Separation with a Diffusion Prior

Abstract:
Blind Speech Separation (BSS) aims to separate multiple speech sources from audio mixtures recorded by a microphone array. The problem is challenging because it is a blind inverse problem, i.e., the microphone array geometry, the room impulse response (RIR), and the speech sources, are all unknown. We propose ArrayDPS to solve the BSS problem in an unsupervised, array-agnostic, and generative manner. The core idea builds on diffusion posterior sampling (DPS), but unlike DPS where the likelihood is tractable, ArrayDPS must approximate the likelihood by formulating a separate optimization problem. The solution to the optimization approximates room acoustics and the relative transfer functions between microphones. These approximations, along with the diffusion priors, iterate through the ArrayDPS sampling process and ultimately yield separated voice sources. We only need a simple single-speaker speech diffusion model as a prior, along with the mixtures recorded at the microphones; no microphone array information is necessary. Evaluation results show that ArrayDPS outperforms all baseline unsupervised methods while being comparable to supervised methods in terms of SDR. Audio demos and codes are provided at:https://arraydps.github.io/ArrayDPSDemo/andhttps://github.com/ArrayDPS/ArrayDPS.

Paperid:1580 poster

Authors:Maxime Burchi, Radu Timofte

Title: Accurate and Efficient World Modeling with Masked Latent Transformers

Abstract:
The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such asΔ-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.

Paperid:1581 poster

Authors:Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, Muhan Zhang

Title: Griffin: Towards a Graph-Centric Relational Database Foundation Model

Abstract:
We introduce Griffin, the first foundation model attemptation designed specifically for Relational Databases (RDBs). Unlike previous smaller models focused on single RDB tasks, Griffin unifies the data encoder and task decoder to handle diverse tasks. Additionally, we enhance the architecture by incorporating a cross-attention module and a novel aggregator. Griffin utilizes pretraining on both single-table and RDB datasets, employing advanced encoders for categorical, numerical, and metadata features, along with innovative components such as cross-attention modules and enhanced message-passing neural networks (MPNNs) to capture the complexities of relational data. Evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains (spanning over 150 million nodes), Griffin demonstrates superior or comparable performance to individually trained models, excels in low-data scenarios, and shows strong transferability with similarity and diversity in pretraining across new datasets and tasks, highlighting its potential as a universally applicable foundation model for RDBs. Code available athttps://github.com/yanxwb/Griffin.

Paperid:1582 poster

Authors:Xiaowei Chi, Chun-Kai Fan, Hengyuan Zhang, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi-Min Chan, Wei Xue, Qifeng Liu, Shanghang Zhang, Yike Guo

Title: Empowering World Models with Reflection for Embodied Video Prediction

Abstract:
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios. However, existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios. To address this challenge, we propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction. It leverages the complementary strengths of pre-trained vision-language and video generation models, enabling them to function as a world model in embodied scenarios. To support RoG, we introduce Embodied Video Anticipation Benchmark(EVA-Bench), a comprehensive benchmark that evaluates embodied world models across diverse tasks and scenarios, utilizing both in-domain and OOD datasets. Building on this foundation, we devise a world model, Embodied Video Anticipator (EVA), that follows a multistage training paradigm to generate high-fidelity video frames and apply an autoregressive strategy to enable adaptive generalization for longer video sequences. Extensive experiments demonstrate the efficacy of EVA in various downstream tasks like video generation and robotics, thereby paving the way for large-scale pre-trained models in real-world video prediction applications. The video demos are available athttps://sites.google.com/view/icml-eva.

Paperid:1583 poster

Authors:Xiaokun Feng, Dailing Zhang, Shiyu Hu, Xuchen Li, Meiqi Wu, Jing Zhang, Xiaotang Chen, Kaiqi Huang

Title: CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

Abstract:
Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (e.g., depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at:https://github.com/XiaokunFeng/CSTrack.

Paperid:1584 poster

Authors:Tosca Lechner, Alex Bie, Gautam Kamath

Title: On the Learnability of Distribution Classes with Adaptive Adversaries

Abstract:
We consider the question of learnability of distribution classes in the presence of adaptive adversaries -- that is, adversaries capable of intercepting the samples requested by a learner and applying manipulations with full knowledge of the samples before passing it on to the learner. This stands in contrast to oblivious adversaries, who can only modify the underlying distribution the samples come from but not their i.i.d.\ nature. We formulate a general notion of learnability with respect to adaptive adversaries, taking into account the budget of the adversary. We show that learnability with respect to additive adaptive adversaries is a strictly stronger condition than learnability with respect to additive oblivious adversaries.

Paperid:1585 poster

Authors:Feiyang Wang, Xingquan Zuo, Hai Huang, Gang Chen

Title: TtBA: Two-third Bridge Approach for Decision-Based Adversarial Attack

Abstract:
A key challenge in black-box adversarial attacks is the high query complexity in hard-label settings, where only the top-1 predicted label from the target deep model is accessible. In this paper, we propose a novel normal-vector-based method called Two-third Bridge Attack (TtBA). A innovative bridge direction is introduced which is a weighted combination of the current unit perturbation direction and its unit normal vector, controlled by a weight parameterk. We further use binary search to identifyk=kbridge, which has identical decision boundary as the current direction. Notably, we observe thatk=2/3kbridgeyields a near-optimal perturbation direction, ensuring the stealthiness of the attack. In addition, we investigate the critical importance of local optima during the perturbation direction optimization process and propose a simple and effective approach to detect and escape such local optima. Experimental results on MNIST, FASHION-MNIST, CIFAR10, CIFAR100, and ImageNet datasets demonstrate the strong performance and scalability of our approach. Compared to state-of-the-art non-targeted and targeted attack methods, TtBA consistently delivers superior performance across most experimented datasets and deep learning models. Code is available athttps://anonymous.4open.science/r/TtBA-6ECF.

Paperid:1586 poster

Authors:Minting Pan, Yitao Zheng, Jiajian Li, Yunbo Wang, Xiaokang Yang

Title: Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach

Abstract:
Offline reinforcement learning (RL) enables policy optimization using static datasets, avoiding the risks and costs of extensive real-world exploration. However, it struggles with suboptimal offline behaviors and inaccurate value estimation due to the lack of environmental interaction. We present Video-Enhanced Offline RL (VeoRL), a model-based method that constructs an interactive world model from diverse, unlabeled video data readily available online. Leveraging model-based behavior guidance, our approach transfers commonsense knowledge of control policy and physical dynamics from natural videos to the RL agent within the target domain. VeoRL achieves substantial performance gains (over 100% in some cases) across visual control tasks in robotic manipulation, autonomous driving, and open-world video games. Project page:https://panmt.github.io/VeoRL.github.io.

Paperid:1587 poster

Authors:Quan Zhou, Changhua Pei, Fei Sun, HanJing, Zhengwei Gao, Haiming Zhang, Gaogang Xie, Dan Pei, Jianhui li

Title: KAN-AD: Time Series Anomaly Detection with Kolmogorov–Arnold Networks

Abstract:
Time series anomaly detection (TSAD) underpins real-time monitoring in cloud services and web systems, allowing rapid identification of anomalies to prevent costly failures. Most TSAD methods driven by forecasting models tend to overfit by emphasizing minor fluctuations. Our analysis reveals that effective TSAD should focus on modeling "normal" behavior through smooth local patterns. To achieve this, we reformulate time series modeling as approximating the series with smooth univariate functions. The local smoothness of each univariate function ensures that the fitted time series remains resilient against local disturbances. However, a direct KAN implementation proves susceptible to these disturbances due to the inherently localized characteristics of B-spline functions. We thus propose KAN-AD, replacing B-splines with truncated Fourier expansions and introducing a novel lightweight learning mechanism that emphasizes global patterns while staying robust to local disturbances. On four popular TSAD benchmarks, KAN-AD achieves an average 15% improvement in detection accuracy (with peaks exceeding 27%) over state-of-the-art baselines. Remarkably, it requires fewer than 1,000 trainable parameters, resulting in a 50% faster inference speed compared to the original KAN, demonstrating the approach's efficiency and practical viability.

Paperid:1588 poster

Authors:Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, Fan Tang

Title: FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

Abstract:
Though Rectified Flows (ReFlows) with distillation offer a promising way for fast sampling, its fast inversion transforms images back to structured noise for recovery and following editing remains unsolved. This paper introduces FireFlow, an embarrassingly simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while extending its capabilities to accurate inversion and editing in8steps. We first demonstrate that a carefully designed numerical solver is pivotal for ReFlow inversion, enabling accurate inversion and reconstruction with the precision of a second-order solver while maintaining the practical efficiency of a first-order Euler method. This solver achieves a $3\times$ runtime speedup compared to state-of-the-art ReFlow inversion and editing techniques while delivering smaller reconstruction errors and superior editing results in a training-free mode. The code is available atthis-URL.

Paperid:1589 poster

Authors:Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash

Title: Test-Time Canonicalization by Foundation Models for Robust Perception

Abstract:
Real-world visual perception requires invariance to diverse transformations, yet current methods rely heavily on specialized architectures or training on predefined augmentations, limiting generalization. We propose FoCal, a test-time, data-driven framework that achieves robust perception by leveraging internet-scale visual priors from foundation models. By generating and optimizing candidate transformations toward visually typical, "canonical" views, FoCal enhances robustness without retraining or architectural changes. Experiments demonstrate improved robustness of CLIP and SAM across challenging transformations, including 2D/3D rotations, illumination shifts (contrast and color), and day-night variations. We also highlight potential applications in active vision. Our approach challenges the assumption that transform-specific training is necessary, instead offering a scalable path to invariance. Our code is available at:https://github.com/sutkarsh/focal.

Paperid:1590 poster

Authors:Alexander Atanasov, Jacob A Zavatone-Veth, Cengiz Pehlevan

Title: Risk and cross validation in ridge regression with correlated samples

Abstract:
Recent years have seen substantial advances in our understanding of high-dimensional ridge regression, but existing theories assume that training examples are independent. By leveraging techniques from random matrix theory and free probability, we provide sharp asymptotics for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We demonstrate that in this setting, the generalized cross validation estimator (GCV) fails to correctly predict the out-of-sample risk. However, in the case where the noise residuals have the same correlations as the data points, one can modify the GCV to yield an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit, which we dub CorrGCV. We further extend our asymptotic analysis to the case where the test point has nontrivial correlations with the training set, a setting often encountered in time series forecasting. Assuming knowledge of the correlation structure of the time series, this again yields an extension of the GCV estimator, and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. We validate the predictions of our theory across a variety of high dimensional data.

Paperid:1591 poster

Authors:Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu

Title: A Theory for Conditional Generative Modeling on Multiple Data Sources

Abstract:
The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper provides a first attempt to fill this gap by rigorously analyzing multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments validate our theory.

Paperid:1592 poster

Authors:Changze Lv, Jingwen Xu, Yiyang Lu, Xiaohua Wang, Zhenghua Wang, Zhibo Xu, Di Yu, Xin Du, Xiaoqing Zheng, Xuanjing Huang

Title: Dendritic Localized Learning: Toward Biologically Plausible Algorithm

Abstract:
Backpropagation is the foundational algorithm for training neural networks and a key driver of deep learning's success. However, its biological plausibility has been challenged due to three primary limitations: weight symmetry, reliance on global error signals, and the dual-phase nature of training, as highlighted by the existing literature. Although various alternative learning approaches have been proposed to address these issues, most either fail to satisfy all three criteria simultaneously or yield suboptimal results. Inspired by the dynamics and plasticity of pyramidal neurons, we propose Dendritic Localized Learning (DLL), a novel learning algorithm designed to overcome these challenges. Extensive empirical experiments demonstrate that DLL satisfies all three criteria of biological plausibility while achieving state-of-the-art performance among algorithms that meet these requirements. Furthermore, DLL exhibits strong generalization across a range of architectures, including MLPs, CNNs, and RNNs. These results, benchmarked against existing biologically plausible learning algorithms, offer valuable empirical insights for future research. We hope this study can inspire the development of new biologically plausible algorithms for training multilayer networks and advancing progress in both neuroscience and machine learning. Our code is available athttps://github.com/Lvchangze/Dendritic-Localized-Learning.

Paperid:1593 poster

Authors:Mohamad Fares El Hajj Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi

Title: Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Abstract:
Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies- optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference-time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing-based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi-objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

Paperid:1594 poster

Authors:Jan Ole Ernst, Aniket Chatterjee, Tim Franzmeyer, Axel Kuhn

Title: Reinforcement Learning for Quantum Control under Physical Constraints

Abstract:
Quantum control is concerned with the realisation of desired dynamics in quantum systems, serving as a linchpin for advancing quantum technologies and fundamental research. Analytic approaches and standard optimisation algorithms do not yield satisfactory solutions for more complex quantum systems, and especially not for real world quantum systems which are open and noisy. We devise a physics-constrained Reinforcement Learning (RL) algorithm that restricts the space of possible solutions. We incorporate priors about the desired time scales of the quantum state dynamics - as well as realistic control signal limitations - as constraints to the RL algorithm. These constraints improve solution quality and enhance computational scaleability. We evaluate our method on three broadly relevant quantum systems and incorporate real-world complications, arising from dissipation and control signal perturbations. We achieve both higher fidelities - which exceed 0.999 across all systems - and better robustness to time-dependent perturbations and experimental imperfections than previous methods. Lastly, we demonstrate that incorporating multi-step feedback can yield solutions robust even to strong perturbations. Our implementation can be found at:https://github.com/jan-o-e/RL4qcWpc.

Paperid:1595 poster

Authors:Etienne Boursier, Nicolas Flammarion

Title: Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks

Abstract:
Understanding generalization of overparametrized models remains a fundamental challenge in machine learning. The literature mostly studies generalization from an interpolation point of view, taking convergence towards a global minimum of the training loss for granted. This interpolation paradigm does not seem valid for complex tasks such as in-context learning or diffusion. It has instead been empirically observed that the trained models go from global minima to spurious local minima of the training loss as the number of training samples becomes larger than some level we call optimization threshold. This paper explores theoretically this phenomenon in the context of two-layer ReLU networks. We demonstrate that, despite overparametrization, networks might converge towards simpler solutions rather than interpolating training data, which leads to a drastic improvement on the test loss. Our analysis relies on the so called early alignment phase, during which neurons align toward specific directions. This directional alignment leads to a simplicity bias, wherein the network approximates the ground truth model without converging to the global minimum of the training loss. Our results suggest this bias, resulting in an optimization threshold from which interpolation is not reached anymore, is beneficial and enhances the generalization of trained models.

Paperid:1596 poster

Authors:Stefano Cortinovis, Francois Caron

Title: FAB-PPI: Frequentist, Assisted by Bayes, Prediction-Powered Inference

Abstract:
Prediction-powered inference (PPI) enables valid statistical inference by combining experimental data with machine learning predictions. When a sufficient number of high-quality predictions is available, PPI results in more accurate estimates and tighter confidence intervals than traditional methods. In this paper, we propose to inform the PPI framework with prior knowledge on the quality of the predictions. The resulting method, which we call frequentist, assisted by Bayes, PPI (FAB-PPI), improves over PPI when the observed prediction quality is likely under the prior, while maintaining its frequentist guarantees. Furthermore, when using heavy-tailed priors, FAB-PPI adaptively reverts to standard PPI in low prior probability regions. We demonstrate the benefits of FAB-PPI in real and synthetic examples.

Paperid:1597 poster

Authors:Yuanyuan Yao, Yuan Dong, Lu Chen, Kun Kuang, Ziquan Fang, Cheng Long, Yunjun Gao, TIANYI LI

Title: Arrow: Accelerator for Time Series Causal Discovery with Time Weaving

Abstract:
Current causal discovery methods for time series data can effectively address a variety of scenarios; however, they remain constrained by inefficiencies. The significant inefficiencies arise primarily from the high computational costs associated with binning, the uncertainty in selecting appropriate time lags, and the extensive sets of candidate variables. To achieve both high efficiency and effectiveness of causal discovery, we introduce an accelerator termed ARROW. It incorporates an innovative concept termed “Time Weaving�?that efficiently encodes time series data to well capture the dynamic trends, thereby mitigating computational complexity. We also propose a novel time lag discovery strategy utilizing XOR operations, which derives a theorem to obtain the optimal time lag and significantly enhances the efficiency using XOR operations. To optimize the search space for causal relationships, we design an efficient pruning strategy that intelligently identifies the most relevant candidate variables, enhancing the efficiency and accuracy of causal discovery. We applied ARROW to four different types of time series causal discovery algorithms and evaluated it on 25 synthetic and real-world datasets. The results demonstrate that, compared to the original algorithms, ARROW achieves up to 153x speedup while achieving higher accuracy in most cases.

Paperid:1598 poster

Authors:Tao Feng, Wei Li, Didi Zhu, Hangjie Yuan, Wendi Zheng, Dan Zhang, Jie Tang

Title: ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think

Abstract:
Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Optimizers such as SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. However, access to gradient information is not always feasible in practice due to black-box APIs, hardware constraints, or non-differentiable systems, a challenge we refer to as the gradient bans. To bridge this gap, we introduce ZeroFlow, the first benchmark designed to evaluate gradient-free optimization algorithms for overcoming forgetting. ZeroFlow examines a suite of forward pass-based methods across various algorithms, forgetting scenarios, and datasets. Our results show that forward passes alone can be sufficient to mitigate forgetting. We uncover novel optimization principles that highlight the potential of forward pass-based methods in mitigating forgetting, managing task conflicts, and reducing memory demands. Additionally, we propose new enhancements that further improve forgetting resistance using only forward passes. This work provides essential tools and insights to advance the development of forward-pass-based methods for continual learning.

Paperid:1599 poster

Authors:Peter Sorrenson, Daniel Behrend-Uriarte, Christoph Schnoerr, Ullrich Koethe

Title: Learning Distances from Data with Normalizing Flows and Score Matching

Abstract:
Density-based distances (DBDs) provide a principled approach to metric learning by defining distances in terms of the underlying data distribution. By employing a Riemannian metric that increases in regions of low probability density, shortest paths naturally follow the data manifold. Fermat distances, a specific type of DBD, have attractive properties, but existing estimators based on nearest neighbor graphs suffer from poor convergence due to inaccurate density estimates. Moreover, graph-based methods scale poorly to high dimensions, as the proposed geodesics are often insufficiently smooth. We address these challenges in two key ways. First, we learn densities using normalizing flows. Second, we refine geodesics through relaxation, guided by a learned score model. Additionally, we introduce a dimension-adapted Fermat distance that scales intuitively to high dimensions and improves numerical stability. Our work paves the way for the practical use of density-based distances, especially in high-dimensional spaces.

Paperid:1600 poster

Authors:Laurens Devos, Timo Martens, Deniz Can Oruc, Wannes Meert, Hendrik Blockeel, Jesse Davis

Title: Compressing tree ensembles through Level-wise Optimization and Pruning

Abstract:
Tree ensembles (e.g., gradient boosting decision trees) are often used in practice because they offer excellent predictive performance while still being easy and efficient to learn. In some contexts, it is important to additionally optimize their size: this is specifically the case when models need to have verifiable properties (verification of fairness, robustness, etc. is often exponential in the ensemble's size), or when models run on battery-powered devices (smaller ensembles consume less energy, increasing battery autonomy). For this reason, compression of tree ensembles is worth studying. This paper presents LOP, a method for compressing a given tree ensemble by pruning or entirely removing trees in it, while updating leaf predictions in such a way that predictive accuracy is mostly unaffected. Empirically, LOP achieves compression factors that are often 10 to 100 times better than that of competing methods.

Paperid:1601 poster

Authors:Xiangkun Hu, Lemin Kong, Tong He, David Wipf

Title: Explicit Preference Optimization: No Need for an Implicit Reward Model

Abstract:
The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an implicit reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive interpolation behaviors, underappreciated artifacts of the reparameterizations upon which they are based. To this end, we introduce an explicit preference optimization framework termed EXPO that requires no analogous reparameterization to achieve an implicit reward. Quite differently, we merely posit intuitively-appealing regularization factors from scratch that transparently avoid the potential pitfalls of key DPO variants, provably satisfying regularization desiderata that prior methods do not. Empirical results serve to corroborate our analyses and showcase the efficacy of EXPO.

Paperid:1602 poster

Authors:Tuan Quang Dam

Title: Power Mean Estimation in Stochastic Continuous Monte-Carlo Tree Search

Abstract:
Monte-Carlo Tree Search (MCTS) has demonstrated success in online planning for deterministic environments, yet significant challenges remain in adapting it to stochastic Markov Decision Processes (MDPs), particularly in continuous state-action spaces. Existing methods, such as HOOT, which combines MCTS with the Hierarchical Optimistic Optimization (HOO) bandit strategy, address continuous spaces but rely on a logarithmic exploration bonus that lacks theoretical guarantees in non-stationary, stochastic settings. Recent advancements, such as Poly-HOOT, introduced a polynomial bonus term to achieve convergence in deterministic MDPs, though a similar theory for stochastic MDPs remains undeveloped. In this paper, we propose a novel MCTS algorithm, Stochastic-Power-HOOT, designed for continuous, stochastic MDPs. Stochastic-Power-HOOT integrates a power mean as a value backup operator, alongside a polynomial exploration bonus to address the non-stationarity inherent in continuous action spaces. Our theoretical analysis establishes that Stochastic-Power-HOOT converges at a polynomial rate of $\mathcal{O}(n^{-1/2})$, where ( n ) is the number of visited trajectories, thereby extending the non-asymptotic convergence guarantees of Poly-HOOT to stochastic environments. Experimental results on synthetic and stochastic tasks validate our theoretical findings, demonstrating the effectiveness of Stochastic-Power-HOOT in continuous, stochastic domains.

Paperid:1603 poster

Authors:Lingyu Li, Yixu Wang, Haiquan Zhao, Shuqi Kong, Yan Teng, Chunbo Li, Yingchun Wang

Title: Reflection-Bench: Evaluating Epistemic Agency in Large Language Models

Abstract:
With large language models (LLMs) increasingly deployed as cognitive engines for AI agents, the reliability and effectiveness critically hinge on their intrinsic epistemic agency, which remains understudied. Epistemic agency, the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments, represents a base-model-level capacity independent of specific tools, modules, or applications. We characterize the holistic process underlying epistemic agency, which unfolds in seven interrelated dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. Correspondingly, we propose Reflection-Bench, a cognitive-psychology-inspired benchmark consisting of seven tasks with long-term relevance and minimization of data leakage. Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. While state-of-the-art LLMs demonstrate rudimentary signs of epistemic agency, our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms. Our code and data are available athttps://github.com/AI45Lab/ReflectionBench.

Paperid:1604 poster

Authors:Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

Title: WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

Abstract:
We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

Paperid:1605 poster

Authors:Shiyu Wang, Mariam Avagyan, Yihan Shen, Arnaud Lamy, Tingran Wang, Szabolcs Marka, Zsuzsanna Marka, John Wright

Title: Fast, Accurate Manifold Denoising by Tunneling Riemannian Optimization

Abstract:
Learned denoisers play a fundamental role in various signal generation (e.g., diffusion models) and reconstruction (e.g., compressed sensing) architectures, whose success derives from their ability to leverage low-dimensional structure in data. Existing denoising methods, however, either rely on local approximations that require a linear scan of the entire dataset or treat denoising as generic function approximation problems, sacrificing efficiency and interpretability. We consider the problem of efficiently denoising a new noisy data point sampled from an unknown manifoldM∈RD, using only noisy samples. This work proposes a framework for test-time efficient manifold denoising, by framing the concept of "learning-to-denoise" as"learning-to-optimize". We have two technical innovations: (i)online learningmethods which learn to optimize over the manifold of clean signals using only noisy data, effectively "growing" an optimizer one sample at a time. (ii)mixed-ordermethods which guarantee that the learned optimizers achieve global optimality, ensuring both efficiency and near-optimal denoising performance. We corroborate these claims with theoretical analyses of both the complexity and denoising performance of mixed-order traversal. Our experiments on scientific manifolds demonstrate significantly improved complexity-performance tradeoffs compared to nearest neighbor search, which underpins existing provable denoising approaches based on exhaustive search.

Paperid:1606 poster

Authors:Yash Shah, Camila Gonzalez, Mohammad H. Abbasi, Qingyu Zhao, Kilian M. Pohl, Ehsan Adeli

Title: Confounder-Free Continual Learning via Recursive Feature Normalization

Abstract:
Confounders are extraneous variables that affect both the input and the target, resulting in spurious correlations and biased predictions. There are recent advances in dealing with or removing confounders in traditional models, such as metadata normalization (MDN), where the distribution of the learned features is adjusted based on the study confounders. However, in the context of continual learning, where a model learns continuously from new data over time without forgetting, learning feature representations that are invariant to confounders remains a significant challenge. To remove their influence from intermediate feature representations, we introduce the Recursive MDN (R-MDN) layer, which can be integrated into any deep learning architecture, including vision transformers, and at any model stage. R-MDN performs statistical regression via the recursive least squares algorithm to maintain and continually update an internal model state with respect to changing distributions of data and confounding variables. Our experiments demonstrate that R-MDN promotes equitable predictions across population groups, both within static learning and across different stages of continual learning, by reducing catastrophic forgetting caused by confounder effects changing over time.

Paperid:1607 poster

Authors:Raphael A Meyer, William Joseph Swartworth, David Woodruff

Title: Understanding the Kronecker Matrix-Vector Complexity of Linear Algebra

Abstract:
We study the computational model where we can access a matrix $\mathbf{A}$ only by computing matrix-vector products $\mathbf{A}\mathrm{x}$ for vectors of the form $\mathrm{x} = \mathrm{x}_1 \otimes \cdots \otimes \mathrm{x}_q$. We prove exponential lower bounds on the number of queries needed to estimate various properties, including the trace and the top eigenvalue of $\mathbf{A}$. Our proofs hold for all adaptive algorithms, modulo a mild conditioning assumption on the algorithm's queries. We further prove that algorithms whose queries come from a small alphabet (e.g., $\mathrm{x}_i \in \{\pm1\}^n$) cannot test if $\mathbf{A}$ is identically zero with polynomial complexity, despite the fact that a single query using Gaussian vectors solves the problem with probability 1. In steep contrast to the non-Kronecker case, this shows that sketching $\mathbf{A}$ with different distributions of the same subguassian norm can yield exponentially different query complexities. Our proofs follow from the observation that random vectors with Kronecker structure have exponentially smaller inner products than their non-Kronecker counterparts.

Paperid:1608 poster

Authors:Keyon Vafa, Peter G. Chang, Ashesh Rambachan, Sendhil Mullainathan

Title: What Has a Foundation Model Found? Inductive Bias Reveals World Models

Abstract:
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.

Paperid:1609 poster

Authors:Thanh Nguyen-Tang, Raman Arora

Title: Policy-Regret Minimization in Markov Games with Function Approximation

Abstract:
We study policy-regret minimization problem in dynamically evolving environments, modeled as Markov games between a learner and a strategic, adaptive opponent. We propose a general algorithmic framework that achieves the optimal $\mathcal{O}(\sqrt{T})$ policy regret for a wide class of large-scale problems characterized by an Eluder-type condition--extending beyond the tabular settings of previous work. Importantly, our framework uncovers a simpler yet powerful algorithmic approach for handling reactive adversaries, demonstrating that leveraging opponent learning in such settings is key to attaining the optimal $\mathcal{O}(\sqrt{T})$ policy regret.

Paperid:1610 poster

Authors:Wen Lai, Alexander Fraser, Ivan Titov

Title: Joint Localization and Activation Editing for Low-Resource Fine-Tuning

Abstract:
Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing (or steering) techniques, which modify the activations of specific model components. Due to their extremely small parameter counts, these methods show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods.

Paperid:1611 poster

Authors:Ankit Bhardwaj, Amar Phanishayee, Deepak Narayanan, Ryan Stutsman

Title: Auto-reconfiguration for Latency Minimization in CPU-based DNN Serving

Abstract:
In this paper, we investigate how to push the performance limits of serving Deep Neural Network (DNN) models on CPU-based servers. Specifically, we observe that while intra-operator parallelism across multiple threads is an effective way to reduce inference latency, it provides diminishing returns. Our primary insight is that instead of running a single instance of a model with all available threads on a server, running multiple instances each with smaller batch sizes and fewer threads for intra-op parallelism can provide lower inference latency. However, the right configuration is hard to determine manually since it is workload- (DNN model and batch size used by the serving system) and deployment-dependent (number of CPU cores on server). We present Packrat, a new serving system for online inference that given a model and batch size (𝐵) algorithmically picks the optimal number of instances (𝑖), the number of threads each should be allocated (𝑡), and the batch sizes each should operate on (𝑏) that minimizes latency. Packrat is built as an extension to TorchServe and supports online reconfigurations to avoid serving downtime. Averaged across a range of batch sizes, Packrat improves inference latency by 1.43× to 1.83× on a range of commonly used DNNs.

Paperid:1612 poster

Authors:Michael JQ Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin

Title: Diverging Preferences: When do Annotators Disagree and do Models Know?

Abstract:
We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.

Paperid:1613 poster

Authors:Andrea Bertazzi, Tim Johnston, Gareth O. Roberts, Alain Oliviero Durmus

Title: Differential Privacy Guarantees of Markov Chain Monte Carlo Algorithms

Abstract:
This paper aims to provide differential privacy (DP) guarantees for Markov chain Monte Carlo (MCMC) algorithms. In a first part, we establish DP guarantees on samples output by MCMC algorithms as well as Monte Carlo estimators associated with these methods under assumptions on the convergence properties of the underlying Markov chain. In particular, our results highlight the critical condition of ensuring the target distribution is differentially private itself. In a second part, we specialise our analysis to the unadjusted Langevin algorithm and stochastic gradient Langevin dynamics and establish guarantees on their (Rényi) DP. To this end, we develop a novel methodology based on Girsanov's theorem combined with a perturbation trick to obtain bounds for an unbounded domain and in a non-convex setting. We establish: (i) uniform in $n$ privacy guarantees when the state of the chain after $n$ iterations is released, (ii) bounds on the privacy of the entire chain trajectory. These findings provide concrete guidelines for privacy-preserving MCMC.

Paperid:1614 poster

Authors:Sidak Pal Singh, Hossein Mobahi, Atish Agarwala, Yann Dauphin

Title: Avoiding spurious sharpness minimization broadens applicability of SAM

Abstract:
Curvature regularization techniques like Sharpness Aware Minimization (SAM) have shown great promise in improving generalization on vision tasks. However, we find that SAM performs poorly in domains like natural language processing (NLP), often degrading performance --- even with twice the compute budget. We investigate the discrepancy across domains and find that in the NLP setting, SAM is dominated by regularization of the logit statistics --- instead of improving the geometry of the function itself. We use this observation to develop an alternative algorithm we call Functional SAM, which regularizes curvature only through modification of the statistics of the overall function implemented by the neural network, and avoids spurious minimization through logit manipulation. Furthermore, we argue that preconditioning the SAM perturbation also prevents spurious minimization, and when combined with Functional SAM, it gives further improvements. Our proposed algorithms show improved performance over AdamW and SAM baselines when trained for an equal number of steps, in both fixed-length and Chinchilla-style training settings, at various model scales (including billion-parameter scale). On the whole, our work highlights the importance of more precise characterizations of sharpness in broadening the applicability of curvature regularization to large language models (LLMs)

Paperid:1615 poster

Authors:Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

Title: FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

Abstract:
Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator’s dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR’s intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes is available at \href{https://github.com/OliverRensu/FlowAR}{https://github.com/OliverRensu/FlowAR}.

Paperid:1616 poster

Authors:Mohammadsadegh Khorasani, Saber Salehkaleybar, Negar Kiyavash, Matthias Grossglauser

Title: Hierarchical Reinforcement Learning with Targeted Causal Interventions

Abstract:
Hierarchical reinforcement learning (HRL) improves the efficiency of long-horizon reinforcement-learning tasks with sparse rewards by decomposing the task into a hierarchy of subgoals. The main challenge of HRL is efficient discovery of the hierarchical structure among subgoals and utilizing this structure to achieve the final goal. We address this challenge by modeling the subgoal structure as a causal graph and propose a causal discovery algorithm to learn it. Additionally, rather than intervening on the subgoals at random during exploration, we harness the discovered causal model to prioritize subgoal interventions based on their importance in attaining the final goal. These targeted interventions result in a significantly more efficient policy in terms of the training cost. Unlike previous work on causal HRL, which lacked theoretical analysis, we provide a formal analysis of the problem. Specifically, for tree structures and, for a variant of Erdős-Rényi random graphs, our approach results in remarkable improvements. Our experimental results on HRL tasks also illustrate that our proposed framework outperforms existing work in terms of training cost.

Paperid:1617 poster

Authors:Meilin Chen, Jian Tian, Liang Ma, Di Xie, Weijie Chen, Jiang Zhu

Title: Unbiased Evaluation of Large Language Models from a Causal Perspective

Abstract:
Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs. Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.

Paperid:1618 poster

Authors:Yuchen Zhu, Daniel Augusto de Souza, Zhengyan Shi, Mengyue Yang, Pasquale Minervini, Matt Kusner, Alexander Nicholas D'Amour

Title: When Can Proxies Improve the Sample Complexity of Preference Learning?

Abstract:
We address the problem of reward hacking, where maximising a proxy reward does not necessarily increase the true reward. This is a key concern for Large Language Models (LLMs), as they are often fine-tuned on human preferences that may not accurately reflect a true objective. Existing work uses various tricks such as regularisation, tweaks to the reward model, and reward hacking detectors, to limit the influence that such proxy preferences have on a model. Luckily, in many contexts such as medicine, education, and law, a sparse amount of expert data is often available. In these cases, it is often unclear whether the addition of proxy data can improve policy learning. We outline a set of sufficient conditions on proxy feedback that, if satisfied, indicate that proxy data can provably improve the sample complexity of learning the ground truth policy. These conditions can inform the data collection process for specific tasks. The result implies a parameterisation for LLMs that achieves this improved sample complexity. We detail how one can adapt existing architectures to yield this improved sample complexity.

Paperid:1619 poster

Authors:Jakob Burkhardt, Hannah Keller, Claudio Orlandi, Chris Schwiegelshohn

Title: Distributed Differentially Private Data Analytics via Secure Sketching

Abstract:
We introduce thelinear-transformation model, a distributed model of differentially private data analysis. Clients have access to a trusted platform capable of applying a public matrix to their inputs. Such computations can be securely distributed across multiple servers using simple and efficient secure multiparty computation techniques. The linear-transformation model serves as an intermediate model between the highly expressivecentral modeland the minimallocal model. In the central model, clients have access to a trusted platform capable of applying any function to their inputs. However, this expressiveness comes at a cost, as it is often expensive to distribute such computations, leading to the central model typically being implemented by a single trusted server. In contrast, the local model assumes no trusted platform, which forces clients to add significant noise to their data. The linear-transformation model avoids the single point of failure for privacy present in the central model, while also mitigating the high noise required in the local model. We demonstrate that linear transformations are very useful for differential privacy, allowing for the computation of linear sketches of input data. These sketches largely preserve utility for tasks such as private low-rank approximation and private ridge regression, while introducing only minimal error, critically independent of the number of clients.

Paperid:1620 poster

Authors:Makoto Takamoto, Viktor Zaverkin, Mathias Niepert

Title: Physics-Informed Weakly Supervised Learning For Interatomic Potentials

Abstract:
Machine learning is playing an increasingly important role in computational chemistry and materials science, complementing expensive ab initio and first-principles methods. However, machine-learned interatomic potentials (MLIPs) often struggle with generalization and robustness, leading to unphysical energy and force predictions in atomistic simulations. To address this, we propose a physics-informed, weakly supervised training framework for MLIPs. Our method introduces two novel loss functions: one based on Taylor expansions of the potential energy and another enforcing conservative force constraints. This approach enhances accuracy, particularly in low-data regimes, and reduces the reliance on large, expensive training datasets. Extensive experiments across benchmark datasets show up to 2× reductions in energy and force errors for multiple baseline models. Additionally, our method improves the stability of molecular dynamics simulations and facilitates effective fine-tuning of ML foundation models on sparse, high-accuracy ab initio data. An implementation of our method and scripts for executing experiments are available at \url{https://github.com/nec-research/PICPS-ML4Sci}.

Paperid:1621 poster

Authors:Wei Chen, Shigui Li, Jiacheng Li, Junmei Yang, John Paisley, Delu Zeng

Title: Dequantified Diffusion-Schrödinger Bridge for Density Ratio Estimation

Abstract:
Density ratio estimation is fundamental to tasks involving f-divergences, yet existing methods often fail under significantly different distributions or inadequately overlapping supports --- the density-chasm and the support-chasm problems. Additionally, prior approaches yield divergent time scores near boundaries, leading to instability. We design $\textbf{D}^3\textbf{RE}$, a unified framework for robust, stable and efficient density ratio estimation. We propose the dequantified diffusion bridge interpolant (DDBI), which expands support coverage and stabilizes time scores via diffusion bridges and Gaussian dequantization. Building on DDBI, the proposed dequantified Schr{"o}dinger bridge interpolant (DSBI) incorporates optimal transport to solve the Schr{"o}dinger bridge problem, enhancing accuracy and efficiency. Our method offers uniform approximation and bounded time scores in theory, and outperforms baselines empirically in mutual information and density estimation tasks.

Paperid:1622 poster

Authors:Haoran He, Emmanuel Bengio, Qingpeng Cai, Ling Pan

Title: Random Policy Evaluation Uncovers Policies of Generative Flow Networks

Abstract:
The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects with probability proportional to an unnormalized reward function. GFlowNets share a strong connection with reinforcement learning (RL) that typically aims to maximize reward. A number of recent works explored connections between GFlowNets and maximum entropy (MaxEnt) RL, which incorporates entropy regularization into the standard RL objective. However, the relationship between GFlowNets and standard RL remains largely unexplored, despite the inherent similarities in their sequential decision-making nature. While GFlowNets can discover diverse solutions through specialized flow-matching objectives, connecting them to standard RL can simplify their implementation through well-established RL principles and also improve RL's capabilities in diverse solution discovery (a critical requirement in many real-world applications), and bridging this gap can further unlock the potential of both fields. In this paper, we bridge this gap by revealing a fundamental connection between GFlowNets and one of the most basic components of RL -- policy evaluation. Surprisingly, we find that the value function obtained from evaluating a uniform policy is closely associated with the flow functions in GFlowNets. Building upon these insights, we introduce a rectified random policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets based on simply evaluating a fixed random policy, offering a new perspective. Empirical results across extensive benchmarks demonstrate that RPE achieves competitive results compared to previous approaches, shedding light on the previously overlooked connection between (non-MaxEnt) RL and GFlowNets.

Paperid:1623 poster

Authors:Chao Yang, Shuting Cui, Yang Yang, Shuang Li

Title: Evolving Minds: Logic-Informed Inference from Temporal Action Patterns

Abstract:
Understanding human mental states—such as intentions and desires—is crucial for natural AI-human collaboration. However, this is challenging because human actions occur irregularly over time, and the underlying mental states that drive these actions are unobserved. To tackle this, we propose a novel framework that combines a logic-informed temporal point process (TPP) with amortized variational Expectation-Maximization (EM). Our key innovation is integrating logic rules as priors to guide the TPP’s intensity function, allowing the model to capture the interplay between actions and mental events while reducing dependence on large datasets. To handle the intractability of mental state inference, we introduce a discrete-time renewal process to approximate the posterior. By jointly optimizing model parameters, logic rules, and inference networks, our approach infers entire mental event sequences and adaptively predicts future actions. Experiments on both synthetic and real-world datasets show that our method outperforms existing approaches in accurately inferring mental states and predicting actions, demonstrating its effectiveness in modeling human cognitive processes.

Paperid:1624 poster

Authors:Yuzhou Gu, Zhao Song, Junze Yin

Title: Binary Hypothesis Testing for Softmax Models and Leverage Score Models

Abstract:
Softmax distributions are widely used in machine learning, including Large Language Models (LLMs), where the attention unit uses softmax distributions. We abstract the attention unit as the softmax model, where given a vector input, the model produces an output drawn from the softmax distribution (which depends on the vector input). We consider the fundamental problem of binary hypothesis testing in the setting of softmax models. That is, given an unknown softmax model, which is known to be one of the two given softmax models, how many queries are needed to determine which one is the truth? We show that the sample complexity is asymptotically $O(\epsilon^{-2})$ where $\epsilon$ is a certain distance between the parameters of the models. Furthermore, we draw an analogy between the softmax model and the leverage score model, an important tool for algorithm design in linear algebra and graph theory. The leverage score model, on a high level, is a model which, given a vector input, produces an output drawn from a distribution dependent on the input. We obtain similar results for the binary hypothesis testing problem for leverage score models.

Paperid:1625 poster

Authors:Weihua Du, Yiming Yang, Sean Welleck

Title: Optimizing Temperature for Language Models with Multi-Sample Inference

Abstract:
Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature’s role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.

Paperid:1626 poster

Authors:Rui Xue, Tong Zhao, Neil Shah, Xiaorui Liu

Title: Haste Makes Waste: A Simple Approach for Scaling Graph Neural Networks

Abstract:
Graph neural networks (GNNs) have demonstrated remarkable success in graph representation learning and various sampling approaches have been proposed to scale GNNs to applications with large-scale graphs. A class of promising GNN training algorithms take advantage of historical embeddings to reduce the computation and memory cost while maintaining the model expressiveness of GNNs. However, they incur significant computation bias due to the stale feature history. In this paper, we provide a comprehensive analysis of their staleness and inferior performance on large-scale problems. Motivated by our discoveries, we propose a simple yet highly effective training algorithm (REST) to effectively reduce feature staleness, which leads to significantly improved performance and convergence across varying batch sizes, especially when staleness is predominant. The proposed algorithm seamlessly integrates with existing solutions, boasting easy implementation, while comprehensive experiments underscore its superior performance and efficiency on large-scale benchmarks. Specifically, our improvements to state-of-the-art historical embedding methods result in a 2.7% and 3.6% performance enhancement on the ogbn-papers100M and ogbn-products dataset respectively, accompanied by notably accelerated convergence. The code can be found athttps://github.com/RXPHD/REST.

Paperid:1627 poster

Authors:Onat Dalmaz, Arjun D Desai, Reinhard Heckel, Tolga Cukur, Akshay S Chaudhari, Brian Hargreaves

Title: Efficient Noise Calculation in Deep Learning-based MRI Reconstructions

Abstract:
Accelerated MRI reconstruction involves solving an ill-posed inverse problem where noise in acquired data propagates to the reconstructed images. Noise analyses are central to MRI reconstruction for providing an explicit measure of solution fidelity and for guiding the design and deployment of novel reconstruction methods. However, deep learning (DL)-based reconstruction methods have often overlooked noise propagation due to inherent analytical and computational challenges, despite its critical importance. This work proposes a theoretically grounded, memory-efficient technique to calculatevoxel-wise variancefor quantifying uncertainty due to acquisition noise in accelerated MRI reconstructions. Our approach is based on approximating the noise covariance using the DL network's Jacobian, which is intractable to calculate. To circumvent this, we derive anunbiased estimatorfor the diagonal of this covariance matrix—voxel-wise variance�? and introduce a Jacobian sketching technique to efficiently implement it. We evaluate our method on knee and brain MRI datasets for both data-driven and physics-driven networks trained in supervised and unsupervised manners. Compared to empirical references obtained via Monte-Carlo simulations, our technique achieves near-equivalent performance while reducing computational and memory demands by an order of magnitude or more. Furthermore, our method is robust across varying input noise levels, acceleration factors, and diverse undersampling schemes, highlighting its broad applicability. Our workreintroducesaccurate and efficient noise analysis as a central tenet of reconstruction algorithms, holding promise to reshape how we evaluate and deploy DL-based MRI.

Paperid:1628 poster

Authors:Jingwei Li, Jing Xu, Zifan Wang, Huishuai Zhang, Jingzhao Zhang

Title: Understanding Nonlinear Implicit Bias via Region Counts in Input Space

Abstract:
One explanation for the strong generalization ability of neural networks is implicit bias. Yet, the definition and mechanism of implicit bias in non-linear contexts remains little understood. In this work, we propose to characterize implicit bias by the count of connected regions in the input space with the same predicted label. Compared with parameter-dependent metrics (e.g., norm or normalized margin), region count can be better adapted to nonlinear, overparameterized models, because it is determined by the function mapping and is invariant to reparametrization. Empirically, we found that small region counts align with geometrically simple decision boundaries and correlate well with good generalization performance. We also observe that good hyper-parameter choices such as larger learning rates and smaller batch sizes can induce small region counts. We further establish the theoretical connections and explain how larger learning rate can induce small region counts in neural networks.

Paperid:1629 poster

Authors:Kexin Huang, Ying Jin, Ryan Li, Michael Y. Li, Emmanuel Candes, Jure Leskovec

Title: Automated Hypothesis Validation with Agentic Sequential Falsifications

Abstract:
Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose POPPER, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, POPPER validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate POPPER on six domains including biology, economics, and sociology. POPPER delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, POPPER achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation. POPPER is freely available athttps://github.com/snap-stanford/POPPER.

Paperid:1630 poster

Authors:Maya Pavlova, Erik Brinkman, Krithika Iyer, Vítor Albiero, Joanna Bitton, Hailey Nguyen, Cristian Canton Ferrer, Ivan Evtimov, Aaron Grattafiori

Title: Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Abstract:
Red teaming aims to assess how large language models (LLMs) can produce content that violates norms, policies, and rules set forth during their safety training. However, most existing automated methods in literature are not representative of the way common users exploit the multi-turn conversational nature of AI models. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vuLnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general purpose model in a way that encourages reasoning through the choices of methods available, the current target model’s response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 96% against smaller models such as Llama 3.1 8B, and 91% against Llama 3.1 70B and 94% for GPT-4o when evaluated against larger models on the JailbreakBench dataset.

Paperid:1631 poster

Authors:Zhengchao Wan, Qingsong Wang, Gal Mishne, Yusu Wang

Title: Elucidating Flow Matching ODE Dynamics via Data Geometry and Denoisers

Abstract:
Flow matching (FM) models extend ODE sampler based diffusion models into a general framework, significantly reducing sampling steps through learned vector fields. However, the theoretical understanding of FM models, particularly how their sample trajectories interact with underlying data geometry, remains underexplored. A rigorous theoretical analysis of FM ODE is essential for sample quality, stability, and broader applicability. In this paper, we advance the theory of FM models through a comprehensive analysis of sample trajectories. Central to our theory is the discovery that the denoiser, a key component of FM models, guides ODE dynamics through attracting and absorbing behaviors that adapt to the data geometry. We identify and analyze the three stages of ODE evolution: in the initial and intermediate stages, trajectories move toward the mean and local clusters of the data. At the terminal stage, we rigorously establish the convergence of FM ODE under weak assumptions, addressing scenarios where the data lie on a low-dimensional submanifold---cases that previous results could not handle. Our terminal stage analysis offers insights into the memorization phenomenon and establishes equivariance properties of FM ODEs. These findings bridge critical gaps in understanding flow matching models, with practical implications for optimizing sampling strategies and architectures guided by the intrinsic geometry of data.

Paperid:1632 poster

Authors:Edward Y Chang

Title: A Checks-and-Balances Framework for Context-Aware Ethical AI Alignment

Abstract:
This paper introduces a checks-and-balances framework for ethical alignment of Large Language Models (LLMs), inspired by three-branch governmental systems. It implements three independent yet interacting components: LLMs as the executive branch for knowledge generation, DIKE as the legislative branch establishing ethical guardrails, and ERIS as the judicial branch for contextual interpretation. Beyond structural separation, we address a fundamental challenge: regulating emotion to shape behaviors. Drawing from psychological theories where managing emotional responses prevents harmful behaviors, we develop a self-supervised learning pipeline that maps emotions to linguistic behaviors, enabling precise behavioral modulation through emotional conditioning. By integrating this approach with adversarial testing, our framework demonstrates how DIKE and ERIS direct linguistic behaviors toward ethical outcomes while preserving independence throughout knowledge generation, ethical oversight, and contextual interpretation.

Paperid:1633 poster

Authors:Samidha Verma, Arushi Goyal, Ananya Mathur, Ankit Anand, Sayan Ranu

Title: GRAIL: Graph Edit Distance and Node Alignment using LLM-Generated Code

Abstract:
Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NP-hard, leading to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which is itself NP-hard to compute. (2) They operate as black boxes, offering limited interpretability. (3) They lack cross-domain generalization, necessitating expensive retraining for each new dataset. We address these limitations with GRAIL, introducing a paradigm shift in this domain. Instead of training a neural model to predict GED, GRAIL employs a novel combination of large language models (LLMs) and automated prompt tuning to generate aprogramthat is used to compute GED. This shift from predicting GED to generating programs imparts various advantages, including end-to-end interpretability and an autonomous self-evolutionary learning mechanism without ground-truth supervision. Extensive experiments on seven datasets confirm that GRAIL not only surpasses state-of-the-art GED approximation methods in prediction quality but also achieves robust cross-domain generalization across diverse graph distributions.

Paperid:1634 poster

Authors:Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun MA, Chao Zhang

Title: video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Abstract:
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding. This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.

Paperid:1635 poster

Authors:Ningyi Liao, Zihao Yu, Ruixiao Zeng, Siqiang Luo

Title: Unifews: You Need Fewer Operations for Efficient Graph Neural Networks

Abstract:
Graph Neural Networks (GNNs) have shown promising performance, but at the cost of resource-intensive operations on graph-scale matrices. To reduce computational overhead, previous studies attempt to sparsify the graph or network parameters, but with limited flexibility and precision boundaries. In this work, we propose Unifews, a joint sparsification technique to unify graph and weight matrix operations and enhance GNN learning efficiency. The Unifews design enables adaptive compression across GNN layers with progressively increased sparsity, and is applicable to a variety of architectures with on-the-fly simplification. Theoretically, we establish a novel framework to characterize sparsified GNN learning in view of the graph optimization process, showing that Unifews effectively approximates the learning objective with bounded error and reduced computational overhead. Extensive experiments demonstrate that Unifews achieves efficiency improvements with comparable or better accuracy, including 10-20x matrix operation reduction and up to 100x acceleration on graphs up to billion-edge scale.

Paperid:1636 poster

Authors:Yicheng Xiao, Lin Song, Rui Yang, Cheng Cheng, Yixiao Ge, Xiu Li, Ying Shan

Title: LoRA-Gen: Specializing Large Language Model via Online LoRA Generation

Abstract:
Recent advances have highlighted the benefits of scaling language models to enhance performance across a wide range of NLP tasks. However, these approaches still face limitations in effectiveness and efficiency when applied to domain-specific tasks, particularly for small edge-side models. We propose the LoRA-Gen framework, which utilizes a large cloud-side model to generate LoRA parameters for edge-side models based on task descriptions. By employing the reparameterization technique, we merge the LoRA parameters into the edge-side model to achieve flexible specialization. Our method facilitates knowledge transfer between models while significantly improving the inference efficiency of the specialized model by reducing the input context length. Without specialized training, LoRA-Gen outperforms conventional LoRA fine-tuning, which achieves competitive accuracy and a 2.1x speedup with TinyLLaMA-1.1B in reasoning tasks. Besides, our method delivers a compress ratio of 10.1x with Gemma-2B on intelligent agent tasks.

Paperid:1637 poster

Authors:Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su

Title: Rethinking the Bias of Foundation Model under Long-tailed Distribution

Abstract:
Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances from pre-training affect long-tailed downstream tasks. Specifically, we find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we build our method on causal learning and view the incomplete semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Notably, we achieve an average performance increase of about 1.67% on each dataset.

Paperid:1638 poster

Authors:Ruo-Jing Dong, Yu Yao, Bo Han, Tongliang Liu

Title: A Lens into Interpretable Transformer Mistakes via Semantic Dependency

Abstract:
Semantic Dependency refers to the relationship between words in a sentence where the meaning of one word depends on another, which is important for natural language understanding. In this paper, we investigate the role of semantic dependencies in answering questions for transformer models, which is achieved by analyzing how token values shift in response to changes in semantics. Through extensive experiments on models including the BERT series, GPT, and LLaMA, we uncover the following key findings: 1). Most tokens primarily retain their original semantic information even as they propagate through multiple layers. 2). Models can encode truthful semantic dependencies in tokens in the final layer. 3). Mistakes in model answers often stem from specific tokens encoded with incorrect semantic dependencies. Furthermore, we found that addressing the incorrectness by directly adjusting parameters is challenging because the same parameters can encode both correct and incorrect semantic dependencies depending on the context. Our findings provide insights into the causes of incorrect information generation in transformers and help the future development of robust and reliable models.

Paperid:1639 poster

Authors:Yaoyang Liu, Junlin Li, Yinjun Wu, zhen chen

Title: POQD: Performance-Oriented Query Decomposer for Multi-vector retrieval

Abstract:
Although Multi-Vector Retrieval (MVR) has achieved the state of the art on many information retrieval (IR) tasks, its performance highly depends on how to decompose queries into smaller pieces, say phrases or tokens. However, optimizing query decomposition for MVR performance is not end-to-end differentiable. Even worse, jointly solving this problem and training the downstream retrieval-based systems, say RAG systems could be highly inefficient. To overcome these challenges, we propose Performance-Oriented Query Decomposer (POQD), a novel query decomposition framework for MVR. POQD leverages one LLM for query decomposition and searches the optimal prompt with an LLM-based optimizer. We further propose an end-to-end training algorithm to alternatively optimize the prompt for query decomposition and the downstream models. This algorithm can achieve superior MVR performance at a reasonable training cost as our theoretical analysis suggests. POQD can be integrated seamlessly into arbitrary retrieval-based systems such as Retrieval-Augmented Generation (RAG) systems. Extensive empirical studies on representative RAG-based QA tasks show that POQD outperforms existing query decomposition strategies in both retrieval performance and end-to-end QA accuracy. POQD is available athttps://github.com/PKU-SDS-lab/POQD-ICML25.

Paperid:1640 poster

Authors:Muzhou Yu, Shuyun Lin, Lei Ma, Bo Lei, Kaisheng Ma

Title: MindCustomer: Multi-Context Image Generation Blended with Brain Signal

Abstract:
Advancements in generative models have promoted text- and image-based multi-context image generation. Brain signals, offering a direct representation of user intent, present new opportunities for image customization. However, it faces challenges in brain interpretation, cross-modal context fusion and retention. In this paper, we present MindCustomer to explore the blending of visual brain signals in multi-context image generation. We first design shared neural data augmentation for stable cross-subject brain embedding by introducing the Image-Brain Translator (IBT) to generate brain responses from visual images. Then, we propose an effective cross-modal information fusion pipeline that mask-freely adapts distinct semantics from image and brain contexts within a diffusion model. It resolves semantic conflicts for context preservation and enables harmonious context integration. During the fusion pipeline, we further utilize the IBT to transfer image context to the brain representation to mitigate the cross-modal disparity. MindCustomer enables cross-subject generation, delivering unified, high-quality, and natural image outputs. Moreover, it exhibits strong generalization for new subjects via few-shot learning, indicating the potential for practical application. As the first work for multi-context blending with brain signal, MindCustomer lays a foundational exploration and inspiration for future brain-controlled generative technologies.

Paperid:1641 poster

Authors:Wenqin Liu, Haoze Hou, Erdun Gao, Biwei Huang, Qiuhong Ke, Howard Bondell, Mingming Gong

Title: MissScore: High-Order Score Estimation in the Presence of Missing Data

Abstract:
Score-based generative models are essential in various machine learning applications, with strong capabilities in generation quality. In particular, high-order derivatives (scores) of data density offer deep insights into data distributions, building on the proven effectiveness of first-order scores for modeling and generating synthetic data, unlocking new possibilities for applications. However, learning them typically requires complete data, which is often unavailable in domains such as healthcare and finance due to data corruption, acquisition constraints, or incomplete records. To tackle this challenge, we introduce MissScore, a novel framework for estimating high-order scores in the presence of missing data. We derive objective functions for estimating high-order scores under different missing data mechanisms and propose a new algorithm specifically designed to handle missing data effectively. Our empirical results demonstrate that MissScore accurately and efficiently learns the high-order scores from incomplete data and generates high-quality samples, resulting in strong performance across a range of downstream tasks.

Paperid:1642 poster

Authors:Shijian Zheng, Fangxiao Jin, Shuhai Zhang, Quan Xue, Mingkui Tan

Title: Deep Electromagnetic Structure Design Under Limited Evaluation Budgets

Abstract:
Electromagnetic structure (EMS) design plays a critical role in developing advanced antennas and materials, but remains challenging due to high-dimensional design spaces and expensive evaluations. While existing methods commonly employ high-quality predictors or generators to alleviate evaluations, they are often data-intensive and struggle with real-world scale and budget constraints. To address this, we propose a novel method called Progressive Quadtree-based Search (PQS). Rather than exhaustively exploring the high-dimensional space, PQS converts the conventional image-like layout into a quadtree-based hierarchical representation, enabling a progressive search from global patterns to local details. Furthermore, to lessen reliance on highly accurate predictors, we introduce a consistency-driven sample selection mechanism. This mechanism quantifies the reliability of predictions, balancing exploitation and exploration when selecting candidate designs. We evaluate PQS on two real-world engineering tasks, i.e., Dual-layer Frequency Selective Surface and High-gain Antenna. Experimental results show that our method can achieve satisfactory designs under limited computational budgets, outperforming baseline methods. In particular, compared to generative approaches, it cuts evaluation costs by 75�?5%, effectively saving 20.27�?8.80 days of product designing cycle.

Paperid:1643 poster

Authors:Nico Pelleriti, Max Zimmer, Elias Samuel Wirth, Sebastian Pokutta

Title: Approximating Latent Manifolds in Neural Networks via Vanishing Ideals

Abstract:
Deep neural networks have reshaped modern machine learning by learning powerful latent representations that often align with the manifold hypothesis: high-dimensional data lie on lower-dimensional manifolds. In this paper, we establish a connection between manifold learning and computational algebra by demonstrating how vanishing ideals can characterize the latent manifolds of deep networks. To that end, we propose a new neural architecture that (i) truncates a pretrained network at an intermediate layer, (ii) approximates each class manifold via polynomial generators of the vanishing ideal, and (iii) transforms the resulting latent space into linearly separable features through a single polynomial layer. The resulting models have significantly fewer layers than their pretrained baselines, while maintaining comparable accuracy, achieving higher throughput, and utilizing fewer parameters. Furthermore, drawing on spectral complexity analysis, we derive sharper theoretical guarantees for generalization, showing that our approach can in principle offer tighter bounds than standard deep networks. Numerical experiments confirm the effectiveness and efficiency of the proposed approach.

Paperid:1644 poster

Authors:Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng

Title: Liger: Linearizing Large Language Models to Gated Recurrent Structures

Abstract:
Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents \textbf{Liger}, short for \underline{\textbf{Li}}nearizing LLMs to \underline{\textbf{g}}at\underline{\textbf{e}}d \underline{\textbf{r}}ecurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93% of the Transformer-based LLM performance at 0.02% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters.

Paperid:1645 poster

Authors:Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng

Title: Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Abstract:
Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare and contrast existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical concepts on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.

Paperid:1646 poster

Authors:Jiaqi Zhu, Shaofeng Cai, Yanyan Shen, Gang Chen, Fang Deng, Beng Chin Ooi

Title: In-Context Adaptation to Concept Drift for Learned Database Operations

Abstract:
Machine learning has demonstrated transformative potential for database operations, such as query optimization and in-database data analytics. However, dynamic database environments, characterized by frequent updates and evolving data distributions, introduce concept drift, which leads to performance degradation for learned models and limits their practical applicability. Addressing this challenge requires efficient frameworks capable of adapting to shifting concepts while minimizing the overhead of retraining or fine-tuning.In this paper, we propose FLAIR, an online adaptation framework that introduces a new paradigm called \textit{in-context adaptation} for learned database operations. FLAIR leverages the inherent property of data systems, i.e., immediate availability of execution results for predictions, to enable dynamic context construction. By formalizing adaptation asf:(x|Ct)→y, withCtrepresenting a dynamic context memory, FLAIR delivers predictions aligned with the current concept, eliminating the need for runtime parameter optimization. To achieve this, FLAIR integrates two key modules: a Task Featurization Module for encoding task-specific features into standardized representations, and a Dynamic Decision Engine, pre-trained via Bayesian meta-training, to adapt seamlessly using contextual information at runtime. Extensive experiments across key database tasks demonstrate that FLAIR outperforms state-of-the-art baselines, achieving up to5.2×faster adaptation and reducing error by 22.5% for cardinality estimation.

Paperid:1647 poster

Authors:Tianyu Liu, Kai Sun, Fuchun Sun, Yu Luo, Yuanlong Zhang

Title: Slimming the Fat-Tail: Morphing-Flow for Adaptive Time Series Modeling

Abstract:
Temporal sequences, even after stationarization, often exhibit leptokurtic distributions with fat tails and persistent distribution shifts. These properties destabilize feature dynamics, amplify model variance, and hinder model convergence in time series forecasting. To address this, we propose Morphing-Flow (MoF), a framework that combines a spline-based transform layer (Flow) and a test-time-trained method (Morph), which adaptively normalizes non-stationary, fat-tailed distributions while preserving critical extreme features. MoF ensures that inputs remain within a network’s effective activation space—a structured, normal-like distribution—even under distributional drift. Experiments across eight datasets show that MoF achieves state-of-the-art performance: With a simple linear backbone architecture, it matches the performance of state-of-the-art models on datasets such as Electricity and ETTh2. When paired with a patch-based Mamba architecture, MoF outperforms its closest competitor by 6.3% on average and reduces forecasting errors in fat-tailed datasets such as Exchange by 21.7%. Moreover, MoF acts as a plug-and-play module, boosting performance in existing models without architectural changes.

Paperid:1648 poster

Authors:Yi Zhang, Linjun Huang, Yun Yang, Xiaofeng Shao

Title: Testing Conditional Mean Independence Using Generative Neural Networks

Abstract:
Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional mean functions involved in the population measure. The test statistic is thoughtfully constructed to ensure that even slowly decaying nonparametric estimation errors do not affect the asymptotic accuracy of the test. Our approach demonstrates strong empirical performance in scenarios with high-dimensional covariates and response variable, can handle multivariate responses, and maintains nontrivial power against local alternatives outside ann�?/2neighborhood of the null hypothesis. We also use numerical simulations and real-world imaging data applications to highlight the efficacy and versatility of our testing procedure.

Paperid:1649 poster

Authors:Junhang Li, Yu Guo, Chuhua XIAN, Shengfeng He

Title: Instruct2See: Learning to Remove Any Obstructions Across Distributions

Abstract:
Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available athttps://jhscut.github.io/Instruct2See.

Paperid:1650 poster

Authors:Tao Wang, Ruipeng Zhang, Sicun Gao

Title: Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Abstract:
Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Paperid:1651 poster

Authors:Thomas L Lee, William Toner, Rajkarn Singh, Artjom Joosen, Martin Asenov

Title: Lightweight Online Adaption for Time Series Foundation Model Forecasts

Abstract:
Foundation models (FMs) have emerged as a promising approach for time series forecasting. While effective, FMs typically remain fixed during deployment due to the high computational costs of learning them online. Consequently, deployed FMs fail to adapt their forecasts to current data characteristics, despite the availability of online feedback from newly arriving data. This raises the question of whether FM performance can be enhanced by theefficientusage of this feedback. We proposeELFto answer this question. ELF is a lightweight mechanism for the online adaption of FM forecasts in response to online feedback. ELF consists of two parts:a)theELF-Forecasterwhich is used to learn the current data distribution; andb)theELF-Weighterwhich is used to combine the forecasts of the FM and the ELF-Forecaster. We evaluate the performance of ELF in conjunction with several recent FMs across a suite of standard time series datasets. Inallof our experiments we find that using ELF improves performance. This work demonstrates how efficient usage of online feedback can be used to improve FM forecasts.

Paperid:1652 poster

Authors:Xiaoling Wu, Junpeng Zhu, Zeng Li

Title: Weight matrices compression based on PDB model in deep neural networks

Abstract:
Weight matrix compression has been demonstrated to effectively reduce overfitting and improve the generalization performance of deep neural networks. Compression is primarily achieved by filtering out noisy eigenvalues of the weight matrix. In this work, a novelPopulation Double Bulk (PDB) modelis proposed to characterize the eigenvalue behavior of the weight matrix, which is more general than the existing Population Unit Bulk (PUB) model. Based on PDB model and Random Matrix Theory (RMT), we have discovered a newPDBLS algorithmfor determining the boundary between noisy eigenvalues and information. APDB Noise-Filtering algorithmis further introduced to reduce the rank of the weight matrix for compression. Experiments show that our PDB model fits the empirical distribution of eigenvalues of the weight matrix better than the PUB model, and our compressed weight matrices have lower rank at the same level of test accuracy. In some cases, our compression method can even improve generalization performance when labels contain noise. The code is avaliable athttps://github.com/xlwu571/PDBLS.

Paperid:1653 poster

Authors:Fei Zhang, Pei Zhang, Baosong Yang, Fei Huang, Yanfeng Wang, Ya Zhang

Title: ConText: Driving In-context Learning for Text Removal and Segmentation

Abstract:
This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available athttps://github.com/Ferenas/ConText.

Paperid:1654 poster

Authors:GuofengDing, Yiding Lu, Peng Hu, Mouxing Yang, Yijie Lin, Xi Peng

Title: Visual Abstraction: A Plug-and-Play Approach for Text-Visual Retrieval

Abstract:
Text-to-visual retrieval often struggles with semantic redundancy and granularity mismatches between textual queries and visual content. Unlike existing methods that address these challenges during training, we propose VISual Abstraction (VISA), a test-time approach that enhances retrieval by transforming visual content into textual descriptions using off-the-shelf large models. The generated text descriptions, with their dense semantics, naturally filter out low-level redundant visual information. To further address granularity issues, VISA incorporates a question-answering process, enhancing the text description with the specific granularity information requested by the user. Extensive experiments demonstrate that VISA brings substantial improvements in text-to-image and text-to-video retrieval for both short- and long-context queries, offering a plug-and-play enhancement to existing retrieval systems.

Paperid:1655 poster

Authors:Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin

Title: Taming Diffusion for Dataset Distillation with High Representativeness

Abstract:
Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D$^3$HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents for the distilled dataset. Furthermore, we propose an efficient sampling scheme to better align the representative latents with the high-normality Gaussian distribution. Our comprehensive experiments demonstrate that D$^3$HR can achieve higher accuracy across different model architectures compared with state-of-the-art baselines in dataset distillation. Source code:https://github.com/lin-zhao-resoLve/D3HR.

Paperid:1656 poster

Authors:Daiheng Gao, Shilin Lu, Wenbo Zhou, Jiaming Chu, Jie Zhang, Mengxi Jia, Bang Zhang, Zhaoxin Fan, Weiming Zhang

Title: EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Abstract:
Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge. This difficulty is especially pronounced in emerging paradigms, such as Stable Diffusion (SD) v3 and Flux, which incorporate flow matching and transformer-based architectures. These advancements limit the transferability of existing concept-erasure techniques that were originally designed for the previous T2I paradigm (e.g., SD v1.4). In this work, we introduce EraseAnything, the first method specifically developed to address concept erasure within the latest flow-based T2I framework. We formulate concept erasure as a bi-level optimization problem, employing LoRA-based parameter tuning and an attention map regularizer to selectively suppress undesirable activations. Furthermore, we propose a self-contrastive learning strategy to ensure that removing unwanted concepts does not inadvertently harm performance on unrelated ones. Experimental results demonstrate that EraseAnything successfully fills the research gap left by earlier methods in this new T2I paradigm, achieving state-of-the-art performance across a wide range of concept erasure tasks.

Paperid:1657 poster

Authors:Jiale Ma, Wenzheng Pan, Yang Li, Junchi Yan

Title: COExpander: Adaptive Solution Expansion for Combinatorial Optimization

Abstract:
Despite rapid progress in neural combinatorial optimization (NCO) for solving CO problems (COPs), as the problem scale grows, several bottlenecks persist: 1) solvers in the Global Prediction (GP) paradigm struggle in long-range decisions where the overly smooth intermediate heatmaps impede effective decoding, and 2) solvers in the Local Construction (LC) paradigm are time-consuming and incapable of tackling large instances due to the onerous auto-regressive process. Observing these challenges, we propose a new paradigm named Adaptive Expansion AE with its instantiation COExpander, positioned to leverage both advantages of GP and LC. COExpander utilizes informative heatmaps generated by a global predictor, which is learned under the guidance of locally determined partial solutions, to in turn direct the expansion of determined decision variables with adaptive step-sizes. To ensure transparent evaluation, we further take the lead to canonicalize 29 benchmarks spanning 6 popular COPs (MIS, MCl, MVC, MCut, TSP, ATSP) and various scales (50-10K nodes), upon which experiments demonstrate concrete SOTA performance of COExpander over these tasks. Source code and our standardized datasets will be made public.

Paperid:1658 poster

Authors:Yu-Zhe Shi, Mingchen Liu, Hanlu Ma, Qiao Xu, Huamin Qu, Kun He, Lecheng Ruan, Qining Wang

Title: Targeted control of fast prototyping through domain-specific interface

Abstract:
Industrial designers have long sought a natural and intuitive way to achieve the targeted control of prototype models---using simple natural language instructions to configure and adjust the models seamlessly according to their intentions, without relying on complex modeling commands. While Large Language Models have shown promise in this area, their potential for controlling prototype models through language remains partially underutilized. This limitation stems from gaps between designers' languages and modeling languages, including mismatch in abstraction levels, fluctuation in semantic precision, and divergence in lexical scopes. To bridge these gaps, we propose an interface architecture that serves as a medium between the two languages. Grounded in design principles derived from a systematic investigation of fast prototyping practices, we devise the interface's operational mechanism and develop an algorithm for its automated domain specification. Both machine-based evaluations and human studies on fast prototyping across various product design domains demonstrate the interface's potential to function as an auxiliary module for Large Language Models, enabling precise and effective targeted control of prototype models.

Paperid:1659 poster

Authors:Albert Gong, Kamilė Stankevičiūtė, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P Gomes, Kilian Q Weinberger

Title: PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

Abstract:
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities, respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities.

Paperid:1660 poster

Authors:Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber

Title: Measuring In-Context Computation Complexity via Hidden State Prediction

Abstract:
Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model's ability to predict its own future hidden states. We show empirically that this metric–in contrast to the next token prediction loss–correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architecture-agnostic "prediction of hidden states" (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.

Paperid:1661 poster

Authors:Han Li, Fei Liu, Zhi Zheng, Yu Zhang, Zhenkun Wang

Title: CaDA: Cross-Problem Routing Solver with Constraint-Aware Dual-Attention

Abstract:
Vehicle routing problems (VRPs) are significant combinatorial optimization problems (COPs) holding substantial practical importance. Recently, neural combinatorial optimization (NCO), which involves training deep learning models on extensive data to learn vehicle routing heuristics, has emerged as a promising approach due to its efficiency and the reduced need for manual algorithm design. However, applying NCO across diverse real-world scenarios with various constraints necessitates cross-problem capabilities. Current cross-problem NCO methods for VRPs typically employ a constraint-unaware model, limiting their cross-problem performance. Furthermore, they rely solely on global connectivity, which fails to focus on key nodes and leads to inefficient representation learning. This paper introduces a \underline{C}onstraint-\underline{A}ware \underline{D}ual-\underline{A}ttention Model (CaDA), designed to address these limitations. CaDA incorporates a constraint prompt that efficiently represents different problem variants. Additionally, it features a dual-attention mechanism with a global branch for capturing broader graph-wide information and a sparse branch that selectively focuses on the key node connections. We comprehensively evaluate our model on 16 different VRPs and compare its performance against existing cross-problem VRP solvers. CaDA achieves state-of-the-art results across all tested VRPs. Our ablation study confirms that each component contributes to its cross-problem learning performance. The source code for CaDA is publicly available at \url{https://github.com/CIAM-Group/CaDA}.

Paperid:1662 poster

Authors:Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack

Title: ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

Abstract:
Self-awareness, i.e., the ability to assess and correct one's generation, is a fundamental aspect of human intelligence, making its replication in large language models (LLMs) an important yet challenging task. Previous works tackle this by employing extensive reinforcement learning or relying on large external verifiers. In this work, we propose Refine via Intrinsic Self-Verification (ReVISE), an efficient and effective framework that enables LLMs to self-correct their outputs through self-verification. The core idea of ReVISE is to enable LLMs to verify their reasoning processes and continually rethink reasoning trajectories based on its verification. To implement this efficiently, we introduce a structured curriculum based on preference learning. Specifically, as ReVISE involves two challenging tasks (i.e., self-verification and reasoning correction), we tackle each task sequentially using curriculum learning, collecting both failed and successful reasoning paths to construct preference pairs for efficient training. During inference, our approach enjoys natural test-time scaling by integrating self-verification and correction capabilities, further enhanced by our proposed confidence-aware decoding mechanism. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves the reasoning performance of LLMs.

Paperid:1663 poster

Authors:KA HIM WONG, Jicheng Zhou, Jiantao Zhou, Yain-Whar Si

Title: An End-to-End Model for Logits-Based Large Language Models Watermarking

Abstract:
The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of end-to-end optimization of the encoder and decoder. In this paper, we introduce a novel end-to-end logits perturbation method for watermarking LLM-generated text. By joint optimization, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online-prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method achieves superior robustness, outperforming distortion-free methods by 37�?9% under paraphrasing and 17.2% on average, while maintaining text quality on par with the distortion-free methods in terms of text perplexity and downstream tasks. Our method can be easily generalized to different LLMs. Code is available athttps://github.com/KAHIMWONG/E2E_LLM_WM.

Paperid:1664 poster

Authors:Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

Title: DataDecide: How to Predict Best Pretraining Data with Small Experiments

Abstract:
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide—the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (�?80% of comparisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval > 80% predictable at the target 1B scale with just 0.01% of the compute.

Paperid:1665 poster

Authors:Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora, Zhiyuan Li, Nathan Srebro

Title: Weak-to-Strong Generalization Even in Random Feature Networks, Provably

Abstract:
Weak-to-Strong Generalization (Burns et al.,2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a complex and pretrained learner like GPT-4, can arise even in simple non-pretrained models, simply due to the size advantage of the student. But, we also show that there are inherint limits to the extent of such weak to strong generalization. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A ‘weak�?teacher, with a small number of units (i.e. random features), is trained on the population, and a ‘strong�?student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. We then show the quantitative limits of weak-to-strong generalization in this model, and in fact in a much broader class of models, for arbitrary teacher and student feature spaces and a broad class of learning rules, including when the student features are pre-trained or otherwise more informative. In particular, we show that in such models the student’s error can only approach zero if the teacher’s error approaches zero, and a strong student cannot “boost�?a slightly-better-then-chance teacher to obtain a small error.

Paperid:1666 poster

Authors:Shuo Xie, Tianhao Wang, Sashank J. Reddi, Sanjiv Kumar, Zhiyuan Li

Title: Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Abstract:
We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.

Paperid:1667 poster

Authors:Aditya Gangrade, Aldo Pacchiano, Clayton Scott, Venkatesh Saligrama

Title: Feasible Action Search for Bandit Linear Programs via Thompson Sampling

Abstract:
We study the 'feasible action search' (FAS) problem for linear bandits, wherein a learner attempts to discover a feasible point for a set of linear constraints $\Phi_* a \ge 0,$ without knowledge of the matrix $\Phi_* \in \mathbb{R}^{m \times d}$. A FAS learner selects a sequence of actions $a_t,$ and uses observations of the form $\Phi_* a_t + \mathrm{noise}$ to either find a point with nearly optimal 'safety margin', or detect that the constraints are infeasible, where the safety margin of an action measures its (signed) distance from the constraint boundary. While of interest in its own right, the FAS problem also directly addresses a key deficiency in the extant theory of 'safe linear bandits' (SLBs), by discovering a safe initialisation for low-regret SLB methods.We propose and analyse a novel efficient FAS-learner. Our method, FAST, is based on Thompson Sampling. It applies acoupledrandom perturbation to an estimate of $\Phi_*,$ and plays a maximin point of a game induced by this perturbed matrix. We prove that FAST stops in $\tilde{O}(d^3/\varepsilon^2 M_*^2)$ steps, and incurs $\tilde{O}(d^3/|M_*|)$ safety costs, to either correctly detect infeasibility, or output a point that is at least $(1-\varepsilon) M_*$-safe, where $M_*$ is theoptimal safety marginof $\Phi_*$. Further, instantiating prior SLB methods with the output of FAS yields the first SLB methods that incur $\tilde{O}(\sqrt{d^3 T/M_*^2})$ regret and $O(1)$ risk without a priori knowledge of a safe action. The main technical novelty lies in the extension of Thompson Sampling to this multiobjective setting, for which we both propose a coupled noise design, and provide an analysis that avoids convexity considerations.

Paperid:1668 poster

Authors:Srinath Dama, Kevin Course, Prasanth B. Nair

Title: Shifting Time: Time-series Forecasting with Khatri-Rao Neural Operators

Abstract:
We present an operator-theoretic framework for temporal and spatio-temporal forecasting based on learning acontinuous time-shift operator. Our operator learning paradigm offers a continuous relaxation of the discrete lag factor used in traditional autoregressive models, enabling the history of a system up to a given time to be mapped to its future values. We parametrize the time-shift operator using Khatri-Rao neural operators (KRNOs), a novel architecture based on non-stationary integral transforms with nearly linear computational scaling. Our framework naturally handles irregularly sampled observations and enables forecasting at super-resolution in both space and time. Extensive numerical studies across diverse temporal and spatio-temporal benchmarks demonstrate that our approach achieves state-of-the-art or competitive performance with leading methods.

Paperid:1669 poster

Authors:Lin-Han Jia, Wen-Chao Hu, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li

Title: Verification Learning: Make Unsupervised Neuro-Symbolic System Feasible

Abstract:
The current Neuro-Symbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data, so if we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts—issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic Combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs and introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels for some tasks, while for others the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements.

Paperid:1670 poster

Authors:Adrien Cortes, Remi Rehm, Victor Letzelter

Title: Winner-takes-all for Multivariate Probabilistic Time Series Forecasting

Abstract:
We introduce $\texttt{TimeMCL}$, a method leveraging the Multiple Choice Learning (MCL) paradigm to forecast multiple plausible time series futures. Our approach employs a neural network with multiple heads and utilizes the Winner-Takes-All (WTA) loss to promote diversity among predictions. MCL has recently gained attention due to its simplicity and ability to address ill-posed and ambiguous tasks. We propose an adaptation of this framework for time-series forecasting, presenting it as an efficient method to predict diverse futures, which we relate to its implicitquantizationobjective. We provide insights into our approach using synthetic data and evaluate it on real-world time series, demonstrating its promising performance at a light computational cost.

Paperid:1671 poster

Authors:Alexander Kolesov, S. I. Manukhov, Vladimir Vladimirovich Palyulin, Alexander Korotin

Title: Field Matching: an Electrostatic Paradigm to Generate and Transfer Data

Abstract:
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modelling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. We then learn the capacitor's electrostatic field using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments.

Paperid:1672 poster

Authors:Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

Title: Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Abstract:
Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a size constraint.

Paperid:1673 poster

Authors:Yiting Liu, Zhi-Hong Deng

Title: Iterative Vectors: In-Context Gradient Steering without Backpropagation

Abstract:
In-context learning has become a standard approach for utilizing language models. However, selecting and processing suitable demonstration examples can be challenging and time-consuming, especially when dealing with large numbers of them. We propose Iterative Vectors (IVs), a technique that explores activation space to enhance in-context performance by simulating gradient updates during inference. IVs extract and iteratively refine activation-based meta-gradients, applying them during inference without requiring backpropagation at any stage. We evaluate IVs across various tasks using four popular models and observe significant improvements. Our findings suggest that in-context activation steering is a promising direction, opening new avenues for future research.

Paperid:1674 poster

Authors:Jun Wu, Jiangtao Wen, Yuxing Han

Title: BackSlash: Rate Constrained Optimized Training of Large Language Models

Abstract:
The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (BackSlash), a novel training-time compression approach based on rate-distortion optimization (RDO). BackSlash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that BackSlash can reduce memory usage by 60% - 90% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, BackSlash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80% pruning rates), and enables network simplification for accelerated inference on edge devices.

Paperid:1675 poster

Authors:Yuheng Lai, Leying Guan

Title: FairICP: Encouraging Equalized Odds via Inverse Conditional Permutation

Abstract:
Equalized odds, an important notion of algorithmic fairness, aims to ensure that sensitive variables, such as race and gender, do not unfairly influence the algorithm's prediction when conditioning on the true outcome. Despite rapid advancements, current research primarily focuses on equalized odds violations caused by a single sensitive attribute, leaving the challenge of simultaneously accounting for multiple attributes under-addressed. We bridge this gap by introducing an in-processing fairness-aware learning approach, FairICP, which integrates adversarial learning with a novel inverse conditional permutation scheme. FairICP offers a flexible and efficient scheme to promote equalized odds under fairness conditions described by complex and multi-dimensional sensitive attributes. The efficacy and adaptability of our method are demonstrated through both simulation studies and empirical analyses of real-world datasets.

Paperid:1676 poster

Authors:Enes Altinisik, Safa Messaoud, Husrev Taha Sencar, Hassan Sajjad, Sanjay Chawla

Title: Explaining the role of Intrinsic Dimensionality in Adversarial Training

Abstract:
Adversarial Training (AT) impacts different architectures in distinct ways: vision models gain robustness but face reduced generalization, encoder-based models exhibit limited robustness improvements with minimal generalization loss, and recent work in latent-space adversarial training demonstrates that decoder-based models achieve improved robustness by applying AT across multiple layers. We provide the first explanation for these trends by leveraging the manifold conjecture: off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization. We show that vision and decoder-based models exhibit low intrinsic dimensionality in earlier layers (favoring off-manifold AEs), whereas encoder-based models do so in later layers (favoring on-manifold AEs). Exploiting this property, we introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality. This reduces the projected gradient descent (PGD) chain length required for AE generation, cutting GPU time by 25�?3% while significantly boosting robustness. We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval augmented generation setups, demonstrating superior robustness with comparable generalization to standard training.

Paperid:1677 poster

Authors:Ziyang Yu, Wenbing Huang, Yang Liu

Title: UniSim: A Unified Simulator for Time-Coarsened Dynamics of Biomolecules

Abstract:
Molecular Dynamics (MD) simulations are essential for understanding the atomic-level behavior of molecular systems, giving insights into their transitions and interactions. However, classical MD techniques are limited by the trade-off between accuracy and efficiency, while recent deep learning-based improvements have mostly focused on single-domain molecules, lacking transferability to unfamiliar molecular systems. Therefore, we proposeUnifiedSimulator (UniSim), which leverages cross-domain knowledge to enhance the understanding of atomic interactions. First, we employ a multi-head pretraining approach to learn a unified atomic representation model from a large and diverse set of molecular data. Then, based on the stochastic interpolant framework, we learn the state transition patterns over long timesteps from MD trajectories, and introduce a force guidance module for rapidly adapting to different chemical environments. Our experiments demonstrate that UniSim achieves highly competitive performance across small molecules, peptides, and proteins.

Paperid:1678 poster

Authors:Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo

Title: Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

Abstract:
Selecting a layer normalization (LN) strategy that stabilizes training and speeds convergence in Transformers remains difficult, even for today’s large language models (LLM). We present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformers. Until recently, Pre-LN and Post-LN have long dominated practices despite their limitations in large-scale training. However, several open-source models have recently begun silently adopting a third strategy without much explanation. This strategy places normalization layerperipherallyaround sublayers, a design we termPeri-LN. While Peri-LN has demonstrated promising performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis delineates the distinct behaviors of LN strategies, showing how each placement shapes activation variance and gradient propagation. To validate our theoretical insight, we conduct extensive experiments on Transformers up to3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement of LN.

Paperid:1679 poster

Authors:Canhong Wen, Yihong Zuo, Wenliang Pan

Title: An Efficient Pruner for Large Language Model with Theoretical Guarantee

Abstract:
Large Language Models (LLMs) have showcased remarkable performance across a range of tasks but are hindered by their massive parameter sizes, which impose significant computational and storage demands. Pruning has emerged as an effective solution to reduce model size, but traditional methods often involve inefficient retraining or rely on heuristic-based one-shot approaches that lack theoretical guarantees. In this paper, we reformulate the pruning problem as an $\ell_0$-penalized optimization problem and propose a monotone accelerated Iterative Hard Thresholding (mAIHT) method. Our approach combines solid theoretical foundations with practical effectiveness, offering a detailed theoretical analysis that covers convergence, convergence rates, and risk upper bounds. Through extensive experiments, we demonstrate that mAIHT outperforms state-of-the-art pruning techniques by effectively pruning the LLaMA-7B model across various evaluation metrics.

Paperid:1680 poster

Authors:Letong Hong, Zhangyang Wang

Title: On the Provable Separation of Scales in Maximal Update Parameterization

Abstract:
Maximal Update Parameterization ($\mu$P) has shown significant promise in allowing zero-shot hyperparameter transfer across neural network scales, reducing the prohibitive cost of hyperparameter tuning for large models. However, the theoretical foundation behind the observed approximate transferability of hyperparameters remains underexplored. Relying on a width-dominance regime, which ensures that as width grows, certain terms of the learning dynamics dominate, we establish the first fundamental separation of scales in $\mu$P between macro-variables (e.g. loss landscapes) and micro-variables (e.g. individual weights). Our formulation explains why hyperparameter tuning can be effectively performed in early training stages, i.e., \textit{early statistics effectively approximate global hyperparameter optima}, implying the potential to further reduce the training costs required for searching optimal hyperparameters. We further apply our main theory to explain an empirical deep learning phenomenon discovered independently by prior work.

Paperid:1681 poster

Authors:Lufei Liu, Tor M. Aamodt

Title: ReFrame: Layer Caching for Accelerated Inference in Real-Time Rendering

Abstract:
Graphics rendering applications increasingly leverage neural networks in tasks such as denoising, supersampling, and frame extrapolation to improve image quality while maintaining frame rates. The temporal coherence inherent in these tasks presents an opportunity to reuse intermediate results from previous frames and avoid redundant computations. Recent work has shown that caching intermediate features to be reused in subsequent inferences is an effective method to reduce latency in diffusion models. We extend this idea to real-time rendering and present ReFrame, which explores different caching policies to optimize trade-offs between quality and performance in rendering workloads. ReFrame can be applied to a variety of encoder-decoder style networks commonly found in rendering pipelines. Experimental results show that we achieve 1.4$\times$ speedup on average with negligible quality loss in three real-time rendering tasks. Code available:https://ubc-aamodt-group.github.io/reframe-layer-caching/

Paperid:1682 poster

Authors:Xiaole Zhang, Peiyu Zhang, Xiongye Xiao, Shixuan Li, Vasileios Tzoumas, Vijay Gupta, Paul Bogdan

Title: End-to-End Learning Framework for Solving Non-Markovian Optimal Control

Abstract:
Integer-order calculus fails to capture the long-range dependence (LRD) and memory effects found in many complex systems. Fractional calculus addresses these gaps through fractional-order integrals and derivatives, but fractional-order dynamical systems pose substantial challenges in system identification and optimal control tasks. In this paper, we theoretically derive the optimal control via linear quadratic regulator (LQR) for fractional-order linear time-invariant (FOLTI) systems and develop an end-to-end deep learning framework based on this theoretical foundation. Our approach establishes a rigorous mathematical model, derives analytical solutions, and incorporates deep learning to achieve data-driven optimal control of FOLTI systems. Our key contributions include: (i) proposing a novel method for system identification and optimal control strategy in FOLTI systems, (ii) developing the first end-to-end data-driven learning framework, Fractional-Order Learning for Optimal Control (FOLOC), that learns control policies from observed trajectories, and (iii) deriving theoretical bounds on the sample complexity for learning accurate control policies under fractional-order dynamics. Experimental results indicate that our method accurately approximates fractional-order system behaviors without relying on Gaussian noise assumptions, pointing to promising avenues for advanced optimal control.

Paperid:1683 poster

Authors:Wenwen He, Wenke Huang, Bin Yang, ShuKan Liu, Mang Ye

Title: SPMC: Self-Purifying Federated Backdoor Defense via Margin Contribution

Abstract:
Federated Learning (FL) enables collaborative training with privacy preservation but is vulnerable to backdoor attacks, where malicious clients degrade model performance on targeted inputs. These attacks exploit FL decentralized nature, while existing defenses, based on isolated behaviors and fixed rules, can be bypassed by adaptive attackers. To address these limitations, we proposeSPMC, a marginal collaboration defense mechanism that leverages intrinsic consistency across clients to estimate inter-client marginal contributions. This allows the system to dynamically reduce the influence of clients whose behavior deviates from the collaborative norm, thus maintaining robustness even as the number of attackers changes. In addition to overcoming proxy-dependent purification's weaknesses, we introduce a self-purification process that locally adjusts suspicious gradients. By aligning them with margin-based model updates, we mitigate the effect of local poisoning. Together, these two modules significantly improve the adaptability and resilience of FL systems, both at the client and server levels. Experimental results on a variety of classification benchmarks demonstrate that SPMC achieves strong defense performance against sophisticated backdoor attacks without sacrificing accuracy on benign tasks. The code is posted at:https://github.com/WenddHe0119/SPMC.

Paperid:1684 poster

Authors:Harit Vishwakarma, Yi Chen, Satya Sai Srinath Namburi GNVV, Sui Jiet Tay, Ramya Korlakai Vinayak, Frederic Sala

Title: Rethinking Confidence Scores and Thresholds in Pseudolabeling-based SSL

Abstract:
Modern semi-supervised learning (SSL) methods rely on pseudolabeling and consistency regularization. Pseudolabeling is typically performed by comparing the model's confidence scores and a predefined threshold. While several heuristics have been proposed to improve threshold selection, the underlying issues of overconfidence and miscalibration in confidence scores remain largely unaddressed, leading to inaccurate pseudolabels, degraded test accuracy, and prolonged training. We take a first-principles approach to learn confidence scores and thresholds with an explicit knob for error. This flexible framework addresses the fundamental question of optimal scores and threshold selection in pseudolabeling. Moreover, it gives practitioners a principled way to control the quality and quantity of pseudolabels. Such control is vital in SSL, where balancing pseudolabel quality and quantity directly affects model performance and training efficiency. Our experiments show that, by integrating this framework with modern SSL methods, we achieve significant improvements in accuracy and training efficiency. In addition, we provide novel insights on the trade-offs between the choices of the error parameter and the end model's performance.

Paperid:1685 poster

Authors:Yicheng Luo, Bowen Zhang, Zhen Liu, Qianli Ma

Title: Hi-Patch: Hierarchical Patch GNN for Irregular Multivariate Time Series

Abstract:
Multi-scale information is crucial for multivariate time series modeling. However, most existing time series multi-scale analysis methods treat all variables in the same manner, making them unsuitable for Irregular Multivariate Time Series (IMTS), where variables have distinct origin scales/sampling rates. To fill this gap, we propose Hi-Patch, a hierarchical patch graph network. Hi-Patch encodes each observation as a node, represents and captures local temporal and inter-variable dependencies of densely sampled variables through an intra-patch graph layer, and obtains patch-level nodes through aggregation. These nodes are then updated and re-aggregated through a stack of inter-patch graph layers, where several scale-specific graph networks progressively extract more global temporal and inter-variable features of both sparsely and densely sampled variables under specific scales. The output of the last layer is fed into task-specific decoders to adapt to different downstream tasks. Experiments on 8 datasets demonstrate that Hi-Patch outperforms state-of-the-art models in IMTS forecasting and classification tasks.

Paperid:1686 poster

Authors:Jiasheng Zhang, Delvin Ce Zhang, Shuang Liang, Zhengpin Li, Rex Ying, Jie Shao

Title: Retrieval-Augmented Language Model for Knowledge-aware Protein Encoding

Abstract:
Protein language models often struggle to capture biological functions due to their lack of factual knowledge (e.g., gene descriptions). Existing solutions leverage protein knowledge graphs (PKGs) as auxiliary pre-training objectives, but lack explicit integration of task-oriented knowledge, making them suffer from limited knowledge exploitation and catastrophic forgetting. The root cause is that they fail to align PKGs with task-specific data, forcing their knowledge modeling to adapt to the knowledge-isolated nature of downstream tasks. In this paper, we propose Knowledge-aware retrieval augmented protein language model (Kara), achieving the first task-oriented and explicit integration of PKGs and protein language models. With a knowledge retriever learning to predict linkages between PKG and task proteins, Kara unifies the knowledge integration of the pre-training and fine-tuning stages with a structure-based regularization, mitigating catastrophic forgetting. To ensure task-oriented integration, Kara uses contextualized virtual tokens to extract graph context as task-specific knowledge for new proteins. Experiments show that Kara outperforms existing knowledge-enhanced models in 6 representative tasks, achieving on average 5.1% improvements.

Paperid:1687 poster

Authors:Sho Takemori, Yuhei Umeda, Aditya Gopalan

Title: Instance-Optimal Pure Exploration for Linear Bandits on Continuous Arms

Abstract:
This paper studies a pure exploration problem with linear bandit feedback on continuous arm sets, aiming to identify an $\epsilon$-optimal arm with high probability. Previous approaches for continuous arm sets have employed instance-independent methods due to technical challenges such as the infinite dimensionality of the space of probability measures and the non-smoothness of the objective function. This paper proposes a novel, tractable algorithm that addresses these challenges by leveraging a reparametrization of the sampling distribution and projected subgradient descent. However, this approach introduces new challenges related to the projection and reconstruction of the distribution from the reparametrization. We address these by focusing on the connection to the approximate Carath'eodory problem. Compared to the original optimization problem on the infinite-dimensional space, our method is tractable, requiring only the solution of quadratic and fractional quadratic problems on the arm set. We establish an instance-dependent optimality for our method, and empirical results on synthetic data demonstrate its superiority over existing instance-independent baselines.

Paperid:1688 poster

Authors:Guy Kornowski, Daogao Liu, Kunal Talwar

Title: Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

Abstract:
We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works. We start by providing a single-pass $(\epsilon,\delta)$-DP algorithm that returns an $(\alpha,\beta)$-stationary point as long as the dataset is of size $\widetilde{\Omega}(\sqrt{d}/\alpha\beta^{3}+d/\epsilon\alpha\beta^{2})$, which is $\Omega(\sqrt{d})$ times smaller than the algorithm of Zhang et al. (2024) for this task, where $d$ is the dimension. We then provide a multi-pass polynomial time algorithm which further improves the sample complexity to $\widetilde{\Omega}\left(d/\beta^2+d^{3/4}/\epsilon\alpha^{1/2}\beta^{3/2}\right)$, by designing a sample efficient ERM algorithm, and proving that Goldstein-stationary points generalize from the empirical loss to the population loss.

Paperid:1689 poster

Authors:Shaoan Xie, Lingjing Kong, Yujia Zheng, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang

Title: Learning Vision and Language Concepts for Controllable Image Generation

Abstract:
Concept learning seeks to extract semantic and interpretable representations of atomic concepts from high-dimensional data such as images and text, which can be instrumental to a variety of downstream tasks (e.g., image generation/editing). Despite its importance, the theoretical foundations for learning atomic concepts and their interactions, especially from multimodal distributions, remain underexplored. In this work, we establish fundamental conditions for learning atomic multimodal concepts and their underlying interactions With identfiability guarantees. We formulate concept learning as a latent variable identification problem, representing atomic concepts in each modality as latent variables, with a graphical model to specify their interactions across modalities. Our theoretical contribution is to provide component-wise identifiability of atomic concepts under flexible, nonparametric conditions that accommodate both continuous and discrete modalities. Building on these theoretical insights, we demonstrate the practical utility of our theory in a downstream task text-to-image (T2I) generation. We develop a principled T2I model that explicitly learns atomic textual and visual concepts with sparse connections between them, allowing us to achieve image generation and editing at the atomic concept level. Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability.

Paperid:1690 poster

Authors:Siddhant Agarwal, Harshit Sikchi, Peter Stone, Amy Zhang

Title: Proto Successor Measure: Representing the Behavior Space of an RL Agent

Abstract:
Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment without additional interactions. Referred to as "zero-shot learning", this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We presentProto Successor Measure: the basis set for all possible behaviors of a Reinforcement Learning Agent in a dynamical system. We prove that any possible behavior (represented using visitation distributions) can be represented using an affine combination of these policy-independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these bases corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using reward-free interaction data from the environment and show that our approach can produce the near-optimal policy at test time for any given reward function without additional environmental interactions. Project page: agarwalsiddhant10.github.io/projects/psm.html.

Paperid:1691 poster

Authors:Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, ELISA FICARRA, Emanuele Rodolà, Simone Calderara, Angelo Porrello

Title: Update Your Transformer to the Latest Release: Re-Basin of Task Vectors

Abstract:
Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available athttps://github.com/aimagelab/TransFusion.

Paperid:1692 poster

Authors:Dong Huang, Pengkun Yang

Title: Sample Complexity of Correlation Detection in the Gaussian Wigner Model

Abstract:
Correlation analysis is a fundamental step in uncovering meaningful insights from complex datasets. In this paper, we study the problem of detecting correlations between two random graphs following the Gaussian Wigner model with unlabeled vertices. Specifically, the task is formulated as a hypothesis testing problem: under the null hypothesis, the two graphs are independent, while under the alternative hypothesis, they are edge-correlated through a latent vertex permutation, yet maintain the same marginal distributions as under the null. We focus on the scenario where two induced subgraphs, each with a fixed number of vertices, are sampled. We determine the optimal rate for the sample size required for correlation detection, derived through an analysis of the conditional second moment. Additionally, we propose an efficient approximate algorithm that significantly reduces running time.

Paperid:1693 poster

Authors:Jiahe Du, Kaixiong Zhou, Xinyu Hong, Zhaozhuo Xu, Jinbo Xu, Xiao Huang

Title: Retrieval Augmented Zero-Shot Enzyme Generation for Specified Substrate

Abstract:
Generating novel enzymes for target molecules in zero-shot scenarios is a fundamental challenge in biomaterial synthesis and chemical production. Without known enzymes for a target molecule, training generative models becomes difficult due to the lack of direct supervision. To address this, we propose a retrieval-augmented generation method that uses existing enzyme-substrate data to guide enzyme design. Our method retrieves enzymes with substrates that share structural similarities with the target molecule, leveraging functional similarities in catalytic activity. Since none of the retrieved enzymes directly catalyze the target molecule, we use a conditioned discrete diffusion model to generate new enzymes based on the retrieved examples. An enzyme-substrate relationship classifier guides the generation process to ensure optimal protein sequence distributions. We evaluate our model on enzyme design tasks with diverse real-world substrates and show that it outperforms existing protein generation methods in catalytic capability, foldability, and docking accuracy. Additionally, we define the zero-shot substrate-specified enzyme generation task and introduce a dataset with evaluation benchmarks.

Paperid:1694 poster

Authors:Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

Title: ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

Abstract:
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs), as most training-free extrapolation methods are not only severely limited by memory bottlenecks, but also suffer from the attention sink, which restricts their scalability and effectiveness in practice. In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck, enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU in a training-free setting. ParallelComp splits the input into chunks, dynamically evicting redundant chunks and irrelevant tokens, supported by a parallel KV cache eviction mechanism. Importantly, we present a systematic theoretical and empirical analysis of attention biases in parallel attention—including the attention sink, recency bias, and middle bias—and reveal that these biases exhibit distinctive patterns under ultra-long context settings. We further design a KV cache eviction technique to mitigate this phenomenon. Experimental results show that ParallelComp enables an 8B model (trained on 8K context) to achieve 91.17% of GPT-4's performance under ultra-long contexts, outperforming closed-source models such as Claude-2 and Kimi-Chat. We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss and pave the way for scalable and robust ultra-long contexts extrapolation in LLMs. We release the code athttps://github.com/menik1126/ParallelComp.

Paperid:1695 poster

Authors:Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon

Title: BoA: Attention-aware Post-training Quantization without Backpropagation

Abstract:
Post-training quantization (PTQ) is a promising solution for deploying large language models (LLMs) on resource-constrained devices. Early methods developed for small-scale networks, such as ResNet, rely on gradient-based optimization, which becomes impractical for hyper-scale LLMs with billions of parameters. While recently proposed backpropagation-free or transformation-based methods alleviate this issue, they ignore inter-layer interactions or use the naive nearest-rounding-based quantized weight assignment to save the heavy computational cost of weight optimization. In this paper, we introduce a novel backpropagation-free PTQ algorithm that optimizes quantized weights by considering inter-layer dependencies. The key innovation is the development of attention-aware Hessian matrices that capture inter-layer interactions within the attention module. Extensive experiments demonstrate that our approach not only outperforms existing weight quantization methods but also shows good synergy with conventional methods to suppress activation outliers, leading to state-of-the-art weight-activation quantization performance. The code will be available athttps://github.com/SamsungLabs/BoA.

Paperid:1696 poster

Authors:Chaochen Gao, Xing W, Zijia Lin, Debing Zhang, Songlin Hu

Title: NExtLong: Toward Effective Long-Context Training without Long Documents

Abstract:
Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.

Paperid:1697 poster

Authors:Hanshen Xiao, Zhen Yang, G. Edward Suh

Title: Trustworthy Machine Learning through Data-Specific Indistinguishability

Abstract:
This paper studies a range of AI/ML trust concepts, including memorization, data poisoning, and copyright, which can be modeled as constraints on the influence of data on a (trained) model, characterized by the outcome difference from a processing function (training algorithm). In this realm, we show that provable trust guarantees can be efficiently provided through a new framework termed Data-Specific Indistinguishability (DSI) to select trust-preserving randomization tightly aligning with targeted outcome differences, as a relaxation of the classic Input-Independent Indistinguishability (III). We establish both the theoretical and algorithmic foundations of DSI with the optimal multivariate Gaussian mechanism. We further show its applications to develop trustworthy deep learning with black-box optimizers. The experimental results on memorization mitigation, backdoor defense, and copyright protection show both the efficiency and effectiveness of the DSI noise mechanism.

Paperid:1698 poster

Authors:Xinmin Qiu, Chen Gege, Bonan Li, Congying Han, Tiande Guo, Zicheng Zhang

Title: Feature out! Let Raw Image as Your Condition for Blind Face Restoration

Abstract:
Blind face restoration (BFR), which involves converting low-quality (LQ) images into high-quality (HQ) images, remains challenging due to complex and unknown degradations. While previous diffusion-based methods utilize feature extractors from LQ images as guidance, using raw LQ images directly as the starting point for the reverse diffusion process offers a theoretically optimal solution. In this work, we propose Pseudo-Hashing Image-to-image Schrödinger Bridge (P-I2SB), a novel framework inspired by optimal mass transport problems, which enhances the restoration potential of Schrödinger Bridge (SB) by correcting data distributions and effectively learning the optimal transport path between any two data distributions. Notably, we theoretically explore and identify that existing methods are limited by the optimality and reversibility of solutions in SB, leading to suboptimal performance. Our approach involves preprocessing HQ images during training by hashing them into pseudo-samples according to a rule related to LQ images, ensuring structural similarity in distribution. This guarantees optimal and reversible solutions in SB, enabling the inference process to learn effectively and allowing P-I2SB to achieve state-of-the-art results in BFR, with more natural textures and retained inference speed compared to previous methods.

Paperid:1699 poster

Authors:Gregoire Fournier, Sourav Medya

Title: COMRECGC: Global Graph Counterfactual Explainer through Common Recourse

Abstract:
Graph neural networks (GNNs) have been widely used in various domains such as social networks, molecular biology, or recommendation systems. Concurrently, different explanations methods of GNNs have arisen to complement its blackbox nature. Explanations of the GNNs�?predictions can be categorized into two types—factual and counterfactual. Given a GNN trained on binary classification into “accept�?and “reject�?classes, a global counterfactual explanation consists in generating a small set of “accept�?graphs relevant to all of the input “reject�?graphs. The transformation of a “reject�?graph into an “accept�?graph is called a recourse. A common recourse explanation is a small set of recourse, from which every “reject�?graph can be turned into an “accept�?graph. Although local counterfactual explanations have been studied extensively, the problem of finding common recourse for global counterfactual explanation remains unexplored, particularly for GNNs. In this paper, we formalize the common recourse explanation problem, and design an effective algorithm, COMRECGC, to solve it. We benchmark our algorithm against strong baselines on four different real-world graphs datasets and demonstrate the superior performance of COMRECGC against the competitors. We also compare the common recourse explanations to the graph counterfactual explanation, showing that common recourse explanations are either comparable or superior, making them worth considering for applications such as drug discovery or computational biology.

Paperid:1700 poster

Authors:Haohan Zou, Jie Feng, Hao Zhao, Yuanyuan Shi

Title: Analytical Lyapunov Function Discovery: An RL-based Generative Approach

Abstract:
Despite advances in learning-based methods, finding valid Lyapunov functions for nonlinear dynamical systems remains challenging. Current neural network approaches face two main issues: challenges in scalable verification and limited interpretability. To address these, we propose an end-to-end framework using transformers to construct analytical Lyapunov functions (local), which simplifies formal verification, enhances interpretability, and provides valuable insights for control engineers. Our framework consists of a transformer-based trainer that generates candidate Lyapunov functions and a falsifier that verifies candidate expressions and refines the model via risk-seeking policy gradient. Unlike Alfarano et al. (2024), which utilizes pre-training and seeks global Lyapunov functions for low-dimensional systems, our model is trained from scratch via reinforcement learning (RL) and succeeds in finding local Lyapunov functions forhigh-dimensionalandnon-polynomialsystems. Given the symbolic nature of the Lyapunov function candidates, we employ efficient optimization methods for falsification during training and formal verification tools for the final verification. We demonstrate the efficiency of our approach on a range of nonlinear dynamical systems with up to ten dimensions and show that it can discover Lyapunov functions not previously identified in the control literature. Full implementation is available onGithub.

Paperid:1701 poster

Authors:Lars Henning Klein, Nearchos Potamitis, Roland Aydin, Robert West, Caglar Gulcehre, Akhil Arora

Title: Fleet of Agents: Coordinated Problem Solving with Large Language Models

Abstract:
While numerous frameworks have been developed to enhance the reasoning abilities of large language models (LLMs), there is a scarcity of methods that effectively balance the trade-off between cost and quality. In this paper, we introduce Fleet of Agents (FoA), a novel and intuitive yet principled framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a multitude of agents, each exploring the search space autonomously, followed by a selection phase where resampling based on a heuristic value function optimizes the balance between exploration and exploitation. This mechanism enables dynamic branching, adapting the exploration strategy based on discovered solutions. We conduct extensive experiments on four benchmark tasks, ``Game of 24'', ``Mini-Crosswords'', ``WebShop'' and ``SciBench'', utilizing four different LLMs, GPT-3.5, GPT-4, LLaMA3.2-11B, and LLaMA3.2-90B. On average across all tasks and LLMs, FoA obtains an absolute quality improvement of�?while requiring only�?5of the cost of previous SOTA methods. Notably, our analyses reveal that (1) FoA achieves the best cost-quality trade-off among all benchmarked methods, and (2) FoA+ LLaMA3.2-11B surpasses the Llama3.2-90B model. FoA is publicly available athttps://github.com/au-clan/FoA.

Paperid:1702 poster

Authors:Shion Takeno, Yoshito Okura, Yu Inatsu, Aoyama Tatsuya, Tomonari Tanaka, Akahane Satoshi, Hiroyuki Hanada, Noriaki Hashimoto, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi

Title: Distributionally Robust Active Learning for Gaussian Process Regression

Abstract:
Gaussian process regression (GPR) or kernel ridge regression is a widely used and powerful tool for nonlinear prediction. Therefore, active learning (AL) for GPR, which actively collects data labels to achieve an accurate prediction with fewer data labels, is an important problem. However, existing AL methods do not theoretically guarantee prediction accuracy for target distribution. Furthermore, as discussed in the distributionally robust learning literature, specifying the target distribution is often difficult. Thus, this paper proposes two AL methods that effectively reduce the worst-case expected error for GPR, which is the worst-case expectation in target distribution candidates. We show an upper bound of the worst-case expected squared error, which suggests that the error will be arbitrarily small by a finite number of data labels under mild conditions. Finally, we demonstrate the effectiveness of the proposed methods through synthetic and real-world datasets.

Paperid:1703 poster

Authors:Deqian Kong, Minglu Zhao, Dehong Xu, Bo Pang, Shu Wang, Edouardo Honig, Zhangzhang Si, Chuan Li, Jianwen Xie, Sirui Xie, Ying Nian Wu

Title: Latent Thought Models with Variational Bayes Inference-Time Computation

Abstract:
We propose a novel class of language models, Latent Thought Models (LTMs), which incorporate explicit latent thought vectors that follow an explicit prior model in latent space. These latent thought vectors guide the autoregressive generation of ground tokens through a Transformer decoder. Training employs a dual-rate optimization process within the classical variational Bayes framework: fast learning of local variational parameters for the posterior distribution of latent vectors (inference-time computation), and slow learning of global decoder parameters. Empirical studies reveal that LTMs possess additional scaling dimensions beyond traditional Large Language Models (LLMs), such as the number of iterations in inference-time computation and number of latent thought vectors. Higher sample efficiency can be achieved by increasing training compute per token, with further gains possible by trading model size for more inference steps. Designed based on these scaling properties, LTMs demonstrate superior sample and parameter efficiency compared to autoregressive models and discrete diffusion models. They significantly outperform these counterparts in validation perplexity and zero-shot language modeling tasks. Additionally, LTMs exhibit emergent few-shot in-context reasoning capabilities that scale with model size, and achieve competitive performance in conditional and unconditional text generation. The project page is available athttps://deqiankong.github.io/blogs/ltm.

Paperid:1704 poster

Authors:Tong Wu, Liangxiao Jiang, Wenjun Zhang, Chaoqun Li

Title: Label Distribution Propagation-based Label Completion for Crowdsourcing

Abstract:
In real-world crowdsourcing scenarios, most workers often annotate a few instances only, which results in a significantly sparse crowdsourced label matrix and subsequently harms the performance of label integration algorithms. Recent work called worker similarity-based label completion (WSLC) has been proven to be an effective algorithm to addressing this issue. However, WSLC considers solely the correlation of the labels annotated by different workers on per individual instance while totally ignoring the correlation of the labels annotated by different workers among similar instances. To fill this gap, we propose a novel label distribution propagation-based label completion (LDPLC) algorithm. At first, we use worker similarity weighted majority voting to initialize a label distribution for each missing label. Then, we design a label distribution propagation algorithm to enable each missing label of each instance to iteratively absorb its neighbors�?label distributions. Finally, we complete each missing label based on its converged label distribution. Experimental results on both real-world and simulated crowdsourced datasets show that LDPLC significantly outperforms WSLC in enhancing the performance of label integration algorithms. Our codes and datasets are available athttps://github.com/jiangliangxiao/LDPLC.

Paperid:1705 poster

Authors:Liangze Jiang, Damien Teney

Title: OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable?

Abstract:
Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. Numerous algorithms exist to address specific settings, butchoosing the right training algorithm for the right datasetwithout trial and error is difficult. Indeed, real-world applications often involve multiple types and combinations of shifts that are hard to analyze theoretically.Method.This work explores the possibility oflearningthe selection of a training algorithm for OOD generalization. We propose a proof of concept (OOD-Chameleon) that formulates the selection as a multi-label classification over candidate algorithms, trained on adataset of datasetsrepresenting a variety of shifts. We evaluate the ability of OOD-Chameleon to rank algorithms on unseen shifts and datasets based only on dataset characteristics, i.e., without training models first, unlike traditional model selection.Findings.Extensive experiments show that the learned selector identifies high-performing algorithms across synthetic, vision, and language tasks. Further inspection shows that it learns non-trivial decision rules, which provide new insights into the applicability of existing algorithms. Overall, this new approach opens the possibility of better exploiting and understanding the plethora of existing algorithms for OOD generalization.

Paperid:1706 poster

Authors:Xunkai Li, Zhengyu Wu, Kaichi Yu, Hongchao Qin, Guang Zeng, Rong-Hua Li, Guoren Wang

Title: Toward Data-centric Directed Graph Learning: An Entropy-driven Approach

Abstract:
Although directed graphs (digraphs) offer strong modeling capabilities for complex topological systems, existing DiGraph Neural Networks (DiGNNs) struggle to fully capture the concealed rich structural information. This data-level limitation results in model-level sub-optimal predictive performance and underscores the necessity of further exploring the potential correlations between the directed edges (topology) and node profiles (features and labels) from a data-centric perspective, thereby empowering model-centric neural networks with stronger encoding capabilities. In this paper, we proposeEntropy-drivenDigraph knowlEdge distillatioN(EDEN), which can serve as a data-centric digraph learning paradigm or a model-agnostic hot-and-plug data-centric Knowledge Distillation (KD) module. EDEN implements data-centric machine learning by constructing a coarse-grained Hierarchical Knowledge Tree (HKT) using proposed hierarchical encoding theory, and refining HKT through mutual information analysis of node profiles to guide knowledge distillation during training. As a general framework, EDEN naturally extends to undirected graphs and consistently delivers strong performance. Extensive experiments on 14 (di)graph datasets—spanning both homophily and heterophily settings—and across four downstream tasks show that EDEN achieves SOTA results and significantly enhances existing (Di)GNNs.

Paperid:1707 poster

Authors:Kaheon Kim, Rentian Yao, Changbo Zhu, Xiaohui Chen

Title: Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization

Abstract:
The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of theunregularizedbarycenter for discretized probability distributions on point clouds is a challenging task when the domain dimensiond>1. Most practical algorithms for approximating the barycenter problem are based on entropic regularization. In this paper, we introduce a nearly linear timeO(mlog⁡m)and linear space complexityO(m)primal-dual algorithm, theWasserstein-DescentH˙1-Ascent(WDHA) algorithm, for computing theexactbarycenter when the input probability density functions are discretized on anm-point grid. The key success of the WDHA algorithm hinges on alternating between two different yet closely related Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. Under reasonable assumptions, we establish the convergence rate and iteration complexity of WDHA to its stationary point when the step size is appropriately chosen. Superior computational efficacy, scalability, and accuracy over the existing Sinkhorn-type algorithms are demonstrated on high-resolution (e.g.,1024×1024images) 2D synthetic and real data.

Paperid:1708 poster

Authors:Yuxiang Zhao, zhuomin chai, Xun Jiang, Qiang Xu, Runsheng Wang, Yibo Lin

Title: DeepLayout: Learning Neural Representations of Circuit Placement Layout

Abstract:
Recent advancements have integrated various deep-learning methodologies into physical design, aiming for workflows acceleration and surpasses human-devised solutions. However, prior research has primarily concentrated on developing task-specific networks, which necessitate a significant investment of time to construct large, specialized datasets, and the unintended isolation of models across different tasks. In this paper, we introduce DeepLayout, the first general representation learning framework specifically designed for backend circuit design. To address the distinct characteristics of post-placement circuits, including topological connectivity and geometric distribution, we propose a hybrid encoding architecture that integrates GNN with spatial transformers. Additionally, the framework includes a flexible decoder module that accommodates a variety of task types, supporting multiple hierarchical outputs such as nets and layouts. To mitigate the high annotation costs associated with layout data, we introduce a mask-based self-supervised learning approach designed explicitly for layout representation. This strategy involves a carefully devised masking approach tailored to layout features, precise reconstruction guidance, and most critically—two key supervised learning tasks. We conduct extensive experiments on large-scale industrial datasets, demonstrating that DeepLayout surpasses state-of-the-art (SOTA) methods specialized for individual tasks on two crucial layout quality assessment benchmarks. The experiment results underscore the framework’s robust capability to learn the intrinsic properties of circuits.

Paperid:1709 poster

Authors:Rudrajit Das, Inderjit S Dhillon, Alessandro Epasto, Adel Javanmard, Jieming Mao, Vahab Mirrokni, Sujay Sanghavi, Peilin Zhong

Title: Retraining with Predicted Hard Labels Provably Increases Model Accuracy

Abstract:
The performance of a model trained with noisy labels is often improved by simplyretrainingthe model with itsown predicted hard labels(i.e.,1/0labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable binary classification setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP), which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training at no extra privacy cost; we call this consensus-based retraining. For example, when training ResNet-18 on CIFAR-100 withϵ=3label DP, we obtain more than6% improvement in accuracy with consensus-based retraining.

Paperid:1710 poster

Authors:Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun

Title: Adaptive Self-improvement LLM Agentic System for ML Library Development

Abstract:
ML libraries, often written in architecture-specific programming languages (ASPLs) that target domain-specific architectures, are key to efficient ML systems. However, writing these high-performance ML libraries is challenging because it requires expert knowledge of both ML algorithms and the ASPL. Large language models (LLMs), on the other hand, have shown general coding capabilities. However, challenges remain when using LLMs for generating ML libraries using ASPLs because 1) this task is complicated even for human experts and 2) there are limited code examples due to the esoteric and evolving nature of ASPLs. We present an adaptive self-improvement agentic system that enables LLMs to perform such complex reasoning under limited data by iteratively improving their capability through self-generated experience. In order to evaluate the effectiveness of our system, we construct a benchmark of a typical ML library and generate ASPL code with both open and closed-source LLMs on this benchmark. Our results show improvements of up to $3.9\times$ over a baseline single LLM.

Paperid:1711 poster

Authors:Yue Wang, Qizhou Wang, Feng Liu, Wei Huang, Yali Du, Xiaojiang Du, Bo Han

Title: GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs

Abstract:
Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. It motivates this paper to explore enhanced unlearning schemes that can mitigate this trade-off. Specifically, we propose Gradient Rectified Unlearning (GRU), an improved framework that regulates the directions of gradient updates during the unlearning procedure such that their side impacts on other, unrelated responses can be minimized. GRU is easy and general to implement, demonstrating practical effectiveness across a variety of well-established unlearning benchmarks.

Paperid:1712 poster

Authors:Kunlun Xu, Xu Zou, Gang Hua, Jiahuan Zhou

Title: Componential Prompt-Knowledge Alignment for Domain Incremental Learning

Abstract:
Domain Incremental Learning (DIL) aims to learn from non-stationary data streams across domains while retaining and utilizing past knowledge. Although prompt-based methods effectively store multi-domain knowledge in prompt parameters and obtain advanced performance through cross-domain prompt fusion, we reveal an intrinsic limitation: component-wise misalignment between domain-specific prompts leads to conflicting knowledge integration and degraded predictions. This arises from the random positioning of knowledge components within prompts, where irrelevant component fusion introduces interference. To address this, we propose Componential Prompt-Knowledge Alignment (KA-Prompt), a novel prompt-based DIL method that introduces component-aware prompt-knowledge alignment during training, significantly improving both the learning and inference capacity of the model. KA-Prompt operates in two phases: (1) Initial Componential Structure Configuring, where a set of old prompts containing knowledge relevant to the new domain are mined via greedy search, which is then exploited to initialize new prompts to achieve reusable knowledge transfer and establish intrinsic alignment between new and old prompts. (2) Online Alignment Preservation, which dynamically identifies the target old prompts and applies adaptive componential consistency constraints as new prompts evolve. Extensive experiments on DIL benchmarks demonstrate the effectiveness of our KA-Prompt. Our source code is available athttps://github.com/zhoujiahuan1991/ICML2025-KA-Prompt.

Paperid:1713 poster

Authors:Siavash Ameli, Chris van der Heide, Liam Hodgkinson, Fred Roosta, Michael W. Mahoney

Title: Determinant Estimation under Memory Constraints and Neural Scaling Laws

Abstract:
Calculating or accurately estimating log-determinants of large positive semi-definite matrices is of fundamental importance in many machine learning tasks. While its cubic computational complexity can already be prohibitive, in modern applications even storing the matrices themselves can pose a memory bottleneck. To address this, we derive a novel hierarchical algorithm based on block-wise computation of the LDL decomposition for large-scale log-determinant calculation in memory-constrained settings. In extreme cases where matrices are highly ill-conditioned, accurately computing the full matrix itself may be infeasible. This is particularly relevant when considering kernel matrices at scale, including the empirical Neural Tangent Kernel (NTK) of neural networks trained on large datasets. Under the assumption of neural scaling laws in the test error, we show that the ratio of pseudo-determinants satisfies a power-law relationship, enabling the derivation of corresponding scaling laws. This allows for accurate estimation of NTK log-determinants from a tiny fraction of the full dataset; in our experiments, this results in a $\sim$100,000$\times$ speedup with improved accuracy to other state-of-the-art approaches. Using these techniques, we successfully estimate log-determinants for dense matrices of extreme sizes, which were previously deemed intractable and inaccessible due to their enormous scale and computational requirements.

Paperid:1714 poster

Authors:Yuhui Ding, Thomas Hofmann

Title: Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment

Abstract:
Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate Euclidean symmetries of 3D molecules by utilizing an SE(3)-equivariant denoising network. However, specialized equivariant architectures limit the scalability and efficiency of diffusion models. In this paper, we propose an approach that relaxes such equivariance constraints. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained over the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency. Our code is available at:https://github.com/skeletondyh/RADM

Paperid:1715 poster

Authors:Andrés Guzmán-Cordero, Floor Eijkelboom, Jan-Willem van de Meent

Title: Exponential Family Variational Flow Matching for Tabular Data Generation

Abstract:
While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in real-world applications. To this end, we developTabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduceExponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.

Paperid:1716 poster

Authors:Sanghyeok Chu, Seonguk Seo, Bohyung Han

Title: Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Abstract:
Recent advances in vision-language models have led to impressive progress in caption generation for images and short video clips. However, these models remain constrained by their limited temporal receptive fields, making it difficult to produce coherent and comprehensive captions for long videos. While several methods have been proposed to aggregate information across video segments, they often rely on supervised fine-tuning or incur significant computational overhead. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. Our approach first generates segment-level captions, corresponding to individual frames or short video intervals, using off-the-shelf visual captioning models. These captions are then parsed into individual scene graphs, which are subsequently consolidated into a unified graph representation that preserves both holistic context and fine-grained details throughout the video. A lightweight graph-to-text decoder then produces the final video-level caption. This framework effectively extends the temporal understanding capabilities of existing models without requiring any additional fine-tuning on long video datasets. Experimental results show that our method significantly outperforms existing LLM-based consolidation approaches, achieving strong zero-shot performance while substantially reducing computational costs.

Paperid:1717 poster

Authors:Nguyen Nhat Minh To, Paul F R Wilson, Viet Nguyen, Mohamed Harmanani, Michael Cooper, Fahimeh Fooladgar, Purang Abolmaesumi, Parvin Mousavi, Rahul Krishnan

Title: Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift

Abstract:
Subpopulation shift, characterized by a disparity in subpopulation distribution between the training and target datasets, can significantly degrade the performance of machine learning models. Current solutions to subpopulation shift involve modifying empirical risk minimization with re-weighting strategies to improve generalization. This strategy relies on assumptions about the number and nature of subpopulations and annotations on group membership, which are unavailable for many real-world datasets. Instead, we propose using an ensemble of diverse classifiers to adaptively capture risk associated with subpopulations. Given a feature extractor network, we replace its standard linear classification layer with a mixture of prototypical classifiers, where each member is trained to classify the data while focusing on different features and samples from other members. In empirical evaluation on nine real-world datasets, covering diverse domains and kinds of subpopulation shift, our method of Diverse Prototypical Ensembles (DPEs) often outperforms the prior state-of-the-art in worst-group accuracy. The code is available athttps://github.com/minhto2802/dpe4subpop.

Paperid:1718 poster

Authors:Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, Jacob Andreas

Title: The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Abstract:
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT)—temporarily updating model parameters during inference using a loss derived from input data—as a mechanism for improving LMs' reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines—reaching $53.0%$ on the public validation set with an 8B-parameter LM and $61.9%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5%$ to $57.8%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.

Paperid:1719 poster

Authors:Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang

Title: Sounding that Object: Interactive Object-Aware Image to Audio Generation

Abstract:
Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.

Paperid:1720 poster

Authors:Jiarui Lu, Xiaoyin Chen, Stephen Zhewen Lu, Aurelie Lozano, Vijil Chenthamarakshan, Payel Das, Jian Tang

Title: Aligning Protein Conformation Ensemble Generation with Physical Feedback

Abstract:
Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-consuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.

Paperid:1721 poster

Authors:Halil Alperen Gozeten, Muhammed Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak

Title: Test-Time Training Provably Improves Transformers as In-context Learners

Abstract:
Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.

Paperid:1722 poster

Authors:Honghua Dong, Jiacheng Yang, Xun Deng, Yuhe Jiang, Gennady Pekhimenko, Fan Long, Xujie Si

Title: TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Abstract:
Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduceTypyBench, a benchmark designed to evaluate LLMs' type inference across entire Python repositories.TypyBenchfeatures two novel metrics:TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, andTypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decentTypeSimscores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency.TypyBenchprovides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at \href{https://github.com/typybench/typybench}.

Paperid:1723 poster

Authors:Guoqiang Zhang, JP Lewis, W. Bastiaan Kleijn

Title: On Exact Bit-level Reversible Transformers Without Changing Architecture

Abstract:
In this work we present the BDIA-transformer, which is an exact bit-level reversible transformer that uses an unchanged standard architecture for inference. The basic idea is to first treat each transformer block as the Euler integration approximation for solving an ordinary differential equation (ODE) and then incorporate the technique of bidirectional integration approximation (BDIA) (originally designed for diffusion inversion) into the neural architecture, together with activation quantization to make it exactly bit-level reversible. In the training process, we let a hyper-parameter $\gamma$ in BDIA-transformer randomly take one of the two values ${0.5, -0.5}$ per training sample per transformer block for averaging every two consecutive integration approximations. As a result, BDIA-transformer can be viewed as training an ensemble of ODE solvers parameterized by a set of binary random variables, which regularizes the model and results in improved validation accuracy. Lightweight side information is required to be stored in the forward process to account for binary quantization loss to enable exact bit-level reversibility. In the inference procedure, the expectation $\mathbb{E}(\gamma)=0$ is taken to make the resulting architecture identical to transformer up to activation quantization. Our experiments in natural language generation, image classification, and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance while also requiring considerably less training memory. Thanks to the regularizing effect of the ensemble, the BDIA-transformer is particularly suitable for fine-tuning with limited data. Source-code can be found via \href{https://github.com/guoqiang-zhang-x/BDIA-Transformer}{thislink}.

Paperid:1724 poster

Authors:Michal Lukasik, Lin Chen, Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, Felix X. Yu, Sashank J. Reddi, Gang Fu, Mohammadhossein Bateni, Sanjiv Kumar

Title: Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation

Abstract:
Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.

Paperid:1725 poster

Authors:Maheed H. Ahmed, Jayanth Bhargav, Mahsa Ghasemi

Title: Online Laplacian-Based Representation Learning in Reinforcement Learning

Abstract:
Representation learning plays a crucial role in reinforcement learning, especially in complex environments with high-dimensional and unstructured states. Effective representations can enhance the efficiency of learning algorithms by improving sample efficiency and generalization across tasks. This paper considers the Laplacian-based framework for representation learning, where the eigenvectors of the Laplacian matrix of the underlying transition graph are leveraged to encode meaningful features from raw sensory observations of the states. Despite the promising algorithmic advances in this framework, it remains an open question whether the Laplacian-based representations can be learned online and with theoretical guarantees along with policy learning. We address this by formulating an online optimization approach using the Asymmetric Graph Drawing Objective (AGDO) and analyzing its convergence via online projected gradient descent under mild assumptions. Our extensive simulation studies empirically validate the convergence guarantees to the true Laplacian representation. Furthermore, we provide insights into the compatibility of different reinforcement learning algorithms with online representation learning.

Paperid:1726 poster

Authors:Jingang QU, David Holzmüller, Gaël Varoquaux, Marine Le Morvan

Title: TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Abstract:
The long-standing dominance of gradient-boosted decision trees on tabular data is currently challenged by tabular foundation models using In-Context Learning (ICL): setting the training data as context for the test data and predicting in a single forward pass without parameter updates. While TabPFNv2 foundation model excels on tables with up to 10K samples, its alternating column- and row-wise attentions make handling large training sets computationally prohibitive. So, can ICL be effectively scaled and deliver a benefit for larger tables? We introduce TabICL, a tabular foundation model for classification, pretrained on synthetic datasets with up to 60K samples and capable of handling 500K samples on affordable resources. This is enabled by a novel two-stage architecture: a column-then-row attention mechanism to build fixed-dimensional embeddings of rows, followed by a transformer for efficient ICL. Across 200 classification datasets from the TALENT benchmark, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On 53 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data. Pretraining code, inference code, and pre-trained models are available athttps://github.com/soda-inria/tabicl.

Paperid:1727 poster

Authors:Sam Gijsen, Kerstin Ritter

Title: EEG-Language Pretraining for Highly Label-Efficient Clinical Phenotyping

Abstract:
Multimodal language modeling has enabled breakthroughs for representation learning, yet remains unexplored in the realm of functional brain data for clinical phenotyping. This paper pioneers EEG-language models (ELMs) trained on clinical reports and 15000 EEGs. We propose to combine multimodal alignment in this novel domain with timeseries cropping and text segmentation, enabling an extension based on multiple instance learning to alleviate misalignment between irrelevant EEG or text segments. Our multimodal models significantly improve over EEG-only models across four clinical evaluations and for the first time enable zero-shot classification as well as retrieval of both neural signals and reports. In sum, these results highlight the potential of ELMs, representing significant progress for clinical applications.

Paperid:1728 poster

Authors:Jinhao Liang, Jacob K Christopher, Sven Koenig, Ferdinando Fioretto

Title: Simultaneous Multi-Robot Motion Planning with Projected Diffusion Models

Abstract:
Recent advances in diffusion models hold significant potential in robotics, enabling the generation of diverse and smooth trajectories directly from raw representations of the environment. Despite this promise, applying diffusion models to motion planning remains challenging due to their difficulty in enforcing critical constraints, such as collision avoidance and kinematic feasibility. These limitations become even more pronounced in Multi-Robot Motion Planning (MRMP), where multiple robots must coordinate in shared spaces. To address these challenges, this work proposesSimultaneousMRMPDiffusion (SMD), a novel approach integrating constrained optimization into the diffusion sampling process to produce collision-free, kinematically feasible trajectories. Additionally, the paper introduces a comprehensive MRMP benchmark to evaluate trajectory planning algorithms across scenarios with varying robot densities, obstacle complexities, and motion constraints. Experimental results show SMD consistently outperforms classical and other learning-based motion planners, achieving higher success rates and efficiency in complex multi-robot environments. The code and implementation are available athttps://github.com/RAISELab-atUVA/Diffusion-MRMP.

Paperid:1729 poster

Authors:Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu

Title: Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts

Abstract:
Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces.Visual reprogramming(VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Ourdecoupled visual prompts(DVP) are optimized using descriptions grouped by explicitcauses (DVP-cse) or unsupervisedclusters (DVP-cls). Then, we integrate the outputs of these visual prompts with aprobabilistic reweighting matrix(PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming.

Paperid:1730 poster

Authors:Qi He, Peiran Yu, Ziyi Chen, Heng Huang

Title: Revisiting Convergence: Shuffling Complexity Beyond Lipschitz Smoothness

Abstract:
Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, under a general bounded variance condition. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.

Paperid:1731 poster

Authors:Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng

Title: How Far Is Video Generation from World Model: A Physical Law Perspective

Abstract:
Scaling video generation models is believed to be promising in building world models that adhere to fundamental physical laws. However, whether these models can discover physical laws purely from vision can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. We focus on the scaling behavior of training diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color $>$ size $>$ velocity $>$ shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws.

Paperid:1732 poster

Authors:Fangyikang Wang, Hubery Yin, Shaobin Zhuang, Huminhao Zhu, Yinan Li, Lei Qian, Chao Zhang, Hanbin Zhao, Hui Qian, Chen Li

Title: Efficiently Access Diffusion Fisher: Within the Outer Product Span Space

Abstract:
Recent Diffusion models (DMs) advancements have explored incorporating the second-order diffusion Fisher information (DF), defined as the negative Hessian of log density, into various downstream tasks and theoretical analysis. However, current practices typically approximate the diffusion Fisher by applying auto-differentiation to the learned score network. This black-box method, though straightforward, lacks any accuracy guarantee and is time-consuming. In this paper, we show that the diffusion Fisher actually resides within a space spanned by the outer products of score and initial data. Based on the outer-product structure, we develop two efficient approximation algorithms to access the trace and matrix-vector multiplication of DF, respectively. These algorithms bypass the auto-differentiation operations with time-efficient vector-product calculations. Furthermore, we establish the approximation error bounds for the proposed algorithms. Experiments in likelihood evaluation and adjoint optimization demonstrate the superior accuracy and reduced computational cost of our proposed algorithms. Additionally, based on the novel outer-product formulation of DF, we design the first numerical verification experiment for the optimal transport property of the general PF-ODE deduced map.

Paperid:1733 poster

Authors:Bo Xue, Dake Bu, Ji Cheng, Yuanyu Wan, Qingfu Zhang

Title: Multi-objective Linear Reinforcement Learning with Lexicographic Rewards

Abstract:
Reinforcement Learning (RL) with linear transition kernels and reward functions has recently attracted growing attention due to its computational efficiency and theoretical advancements. However, prior theoretical research in RL has primarily focused on single-objective problems, resulting in limited theoretical development for multi-objective reinforcement learning (MORL). To bridge this gap, we examine MORL under lexicographic reward structures, where rewards comprise $m$ hierarchically ordered objectives. In this framework, the agent the agent maximizes objectives sequentially, prioritizing the highest-priority objective before considering subsequent ones. We introduce the first MORL algorithm with provable regret guarantees. For any objective $i \in \{1, 2, \ldots, m\}$, our algorithm achieves a regret bound of $\widetilde{O}(\Lambda^i(\lambda) \cdot \sqrt{d^2H^4 K})$, where $\Lambda^i(\lambda) = 1 + \lambda + \cdots + \lambda^{i-1}$, $\lambda$ quantifies the trade-off between conflicting objectives, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. Furthermore, our algorithm can be applied in the misspecified setting, where the regret bound for the $i$-th objective becomes $\widetilde{O}(\Lambda^i(\lambda)\cdot(\sqrt{d^2H^4K}+\epsilon dH^2K))$, with $\epsilon$ denoting the degree of misspecification.

Paperid:1734 poster

Authors:Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re

Title: Cost-efficient Collaboration between On-device and Cloud Language Models

Abstract:
We investigate an emerging setup in which a small, on-device language model (LM) with access to local data collaborates with a frontier, cloud-hosted LM to solve real-world tasks involving financial, medical, and scientific reasoning over long documents.Can a local-remote collaboration reduce cloud inference costs while preserving quality?First, we consider a naïve collaboration protocol, coined MINION, where the local and remote models simply chat back and forth. Because only the local model ingests the full context, this protocol reduces cloud costs by 30.4x, but recovers only 87% of the performance of the frontier model. We identify two key limitations of this protocol: the local model struggles to (1) follow the remote model's multi-step instructions and (2) reason over long contexts. Motivated by these observations, we propose MINIONS, a protocol in which the remote model decomposes the task into easier subtasks over shorter chunks of the document, that are executed locally in parallel. MINIONS reduces costs by 5.7× on average while recovering 97.9% of the remote-only performance. Our analysis reveals several key design choices that influence the trade-off between cost and performance in local-remote systems.

Paperid:1735 poster

Authors:Ya Jiang, Chuxiong Wu, Massieh Kordi Boroujeny, Brian Mark, Kai Zeng

Title: StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models

Abstract:
Watermarking for large language models (LLMs) offers a promising approach to identifying AI-generated text. Existing approaches, however, either compromise the distribution of original generated text by LLMs or are limited to embedding zero-bit information that only allows for watermark detection but ignores identification. We present StealthInk, a stealthy multi-bit watermarking scheme that preserves the original text distribution while enabling the embedding of provenance data, such as userID, TimeStamp, and modelID, within LLM-generated text. This enhances fast traceability without requiring access to the language model's API or prompts. We derive a lower bound on the number of tokens necessary for watermark detection at a fixed equal error rate, which provides insights on how to enhance the capacity. Comprehensive empirical evaluations across diverse tasks highlight the stealthiness, detectability, and resilience of StealthInk, establishing it as an effective solution for LLM watermarking applications.

Paperid:1736 poster

Authors:Masanori Ishikura, Masayuki Karasuyama

Title: Pareto-frontier Entropy Search with Variational Lower Bound Maximization

Abstract:
This study considers multi-objective Bayesian optimization (MOBO) through the information gain of the Pareto-frontier. To calculate the information gain, a predictive distribution conditioned on the Pareto-frontier plays a key role, which is defined as a distribution truncated by the Pareto-frontier. However, it is usually impossible to obtain the entire Pareto-frontier in a continuous domain, and therefore, the complete truncation cannot be known. We consider an approximation of the truncated distribution by using a mixture distribution consisting of two possible approximate truncations obtainable from a subset of the Pareto-frontier, which we call over- and under-truncation. Since the optimal balance of the mixture is unknown beforehand, we propose optimizing the balancing coefficient through the variational lower bound maximization framework, by which the approximation error of the information gain can be minimized. Our empirical evaluation demonstrates the effectiveness of the proposed method particularly when the number of objective functions is large.

Paperid:1737 poster

Authors:Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang

Title: A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models

Abstract:
In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning by gate-rescaled multi-space projections. We provide both a theoretical solution as well as an alternative engineering strategy. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available athttps://github.com/THUDM/MoELoRA_Riemannian.

Paperid:1738 poster

Authors:Jingxin Liu, Renda Han, Wenxuan Tu, Haotian Wang, Junlong Wu, Jieren Cheng

Title: Federated Node-Level Clustering Network with Cross-Subgraph Link Mending

Abstract:
Subgraphs of a complete graph are usually distributed across multiple devices and can only be accessed locally because the raw data cannot be directly shared. However, existing node-level federated graph learning suffers from at least one of the following issues: 1) heavily relying on labeled graph samples that are difficult to obtain in real-world applications, and 2) partitioning a complete graph into several subgraphs inevitably causes missing links, leading to sub-optimal sample representations. To solve these issues, we propose a novel $\underline{\text{Fed}}$erated $\underline{\text{N}}$ode-level $\underline{\text{C}}$lustering $\underline{\text{N}}$etwork (FedNCN), which mends the destroyed cross-subgraph links using clustering prior knowledge. Specifically, within each client, we first design an MLP-based projector to implicitly preserve key clustering properties of a subgraph in a denoising learning-like manner, and then upload the resultant clustering signals that are hard to reconstruct for subsequent cross-subgraph links restoration. In the server, we maximize the potential affinity between subgraphs stemming from clustering signals by graph similarity estimation and minimize redundant links via the N-Cut criterion. Moreover, we employ a GNN-based generator to learn consensus prototypes from this mended graph, enabling the MLP-GNN joint-optimized learner to enhance data privacy during data transmission and further promote the local model for better clustering. Extensive experiments demonstrate the superiority of FedNCN.

Paperid:1739 poster

Authors:Zhengzheng Lou, Ke Zhang, Yucong Wu, Shizhe Hu

Title: Super Deep Contrastive Information Bottleneck for Multi-modal Clustering

Abstract:
In an era of increasingly diverse information sources, multi-modal clustering (MMC) has become a key technology for processing multi-modal data. It can apply and integrate the feature information and potential relationships of different modalities. Although there is a wealth of research on MMC, due to the complexity of datasets, a major challenge remains in how to deeply explore the complex latent information and interdependencies between modalities. To address this issue, this paper proposes a method called super deep contrastive information bottleneck (SDCIB) for MMC, which aims to explore and utilize all types of latent information to the fullest extent. Specifically, the proposed SDCIB explicitly introduces the rich information contained in the encoder's hidden layers into the loss function for the first time, thoroughly mining both modal features and the hidden relationships between modalities. Moreover, the proposed SDCIB performs dual optimization by simultaneously considering consistency information from both the feature distribution and clustering assignment perspectives, the proposed SDCIB significantly improves clustering accuracy and robustness. We conducted experiments on 4 multi-modal datasets and the accuracy of the method on the ESP dataset improved by 9.3%. The results demonstrate the superiority and clever design of the proposed SDCIB. The source code is available onhttps://github.com/ShizheHu.

Paperid:1740 poster

Authors:Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould

Title: Can We Predict Performance of Large Models across Vision-Language Tasks?

Abstract:
Evaluating large vision-language models (LVLMs) is very expensive, due to high computational cost and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix $\boldsymbol{R}$, where each entry $R_{mn}$ represents the performance score of the $m$-th model on the $n$-th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, i.e., predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, which quickly reduces the prediction errors. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data. Our code is available athttps://github.com/Qinyu-Allen-Zhao/CrossPred-LVLM.

Paperid:1741 poster

Authors:Sathvik Reddy Chereddy, John Femiani

Title: SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation

Abstract:
We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fréchet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.

Paperid:1742 poster

Authors:Hao Qiu, Emmanuel Esposito, Mengxiao Zhang

Title: Exploiting Curvature in Online Convex Optimization with Delayed Feedback

Abstract:
In this work, we study the online convex optimization problem with curved losses and delayed feedback. When losses are strongly convex, existing approaches obtain regret bounds of order $d_{\max} \ln T$, where $d_{\max}$ is the maximum delay and $T$ is the time horizon. However, in many cases, this guarantee can be much worse than $\sqrt{d_{\mathrm{tot}}}$ as obtained by a delayed version of online gradient descent, where $d_{\mathrm{tot}}$ is the total delay. We bridge this gap by proposing a variant of follow-the-regularized-leader that obtains regret of order $\min\{\sigma_{\max}\ln T, \sqrt{d_{\mathrm{tot}}}\}$, where $\sigma_{\max}$ is the maximum number of missing observations. We then consider exp-concave losses and extend the Online Newton Step algorithm to handle delays with an adaptive learning rate tuning, achieving regret $\min\{d_{\max} n\ln T, \sqrt{d_{\mathrm{tot}}}\}$ where $n$ is the dimension. To our knowledge, this is the first algorithm to achieve such a regret bound for exp-concave losses. We further consider the problem of unconstrained online linear regression and achieve a similar guarantee by designing a variant of the Vovk-Azoury-Warmuth forecaster with a clipping trick. Finally, we implement our algorithms and conduct experiments under various types of delay and losses, showing an improved performance over existing methods.

Paperid:1743 poster

Authors:Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, Nora Belrose

Title: Automatically Interpreting Millions of Features in Large Language Models

Abstract:
While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which can be more easily interpretable. However, SAEs can have millions of distinct latents, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language interpretations for SAE latents using LLMs. We test our framework on SAEs of varying sizes, activation functions, and losses, trained on two different open-weight LLMs. We introduce five new techniques to score the quality of interpretations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a latent, which we find explains latents that are not recalled by existing methods. We propose guidelines for generating better interpretations that remain valid for a broader set of activating contexts, and discuss pitfalls with existing scoring techniques. Our code is available athttps://github.com/EleutherAI/delphi.

Paperid:1744 poster

Authors:Deming Sheng, Ricardo Henao

Title: Learning Survival Distributions with the Asymmetric Laplace Distribution

Abstract:
Probabilistic survival analysis models seek to estimate the distribution of the future occurrence (time) of an event given a set of covariates. In recent years, these models have preferred nonparametric specifications that avoid directly estimating survival distributions via discretization. Specifically, they estimate the probability of an individual event at fixed times or the time of an event at fixed probabilities (quantiles), using supervised learning. Borrowing ideas from the quantile regression literature, we propose a parametric survival analysis method based on the Asymmetric Laplace Distribution (ALD). This distribution allows for closed-form calculation of popular event summaries such as mean, median, mode, variation, and quantiles. The model is optimized by maximum likelihood to learn, at the individual level, the parameters (location, scale, and asymmetry) of the ALD distribution. Extensive results on synthetic and real-world data demonstrate that the proposed method outperforms parametric and nonparametric approaches in terms of accuracy, discrimination and calibration.

Paperid:1745 poster

Authors:Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

Title: Efficient Multi-modal Long Context Learning for Training-free Adaptation

Abstract:
Traditional approaches to adapting multi-modal large language models (MLLMs) to new tasks have relied heavily on fine-tuning. This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free alternative that embeds demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. Because extremely lengthy inputs introduce prohibitive computational and memory overhead, EMLoC contributes a chunk-wise compression mechanism combined with layer-wise adaptive pruning. It condenses long-context multimodal inputs into compact, task-specific memory representations. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance. This approach is the first to seamlessly integrate compression and pruning techniques for multi-modal long-context learning, offering a scalable and efficient solution for real-world applications. Extensive experiments on diverse vision-language benchmarks demonstrate that EMLoC achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLoC as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. Codes are publicly available athttps://github.com/Zehong-Ma/EMLoC.

Paperid:1746 poster

Authors:Hyeonah Kim, Minsu Kim, Taeyoung Yun, Sanghyeok Choi, Emmanuel Bengio, Alex Hernández-García, Jinkyoo Park

Title: Improved Off-policy Reinforcement Learning in Biological Sequence Design

Abstract:
Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search,δ-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probabilityδ, then denoise them using our policy. We further adaptδbased on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.

Paperid:1747 poster

Authors:Soobin Um, Beomsu Kim, Jong Chul Ye

Title: Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation

Abstract:
Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach calledBoost-and-Skipfor generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations. Code is available athttps://github.com/soobin-um/BnS.

Paperid:1748 poster

Authors:Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than

Title: Provably Improving Generalization of Few-shot models with Synthetic Data

Abstract:
Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly,our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.

Paperid:1749 poster

Authors:Xiaoqian Jiang, Jing Zhang

Title: FedClean: A General Robust Label Noise Correction for Federated Learning

Abstract:
Many federated learning scenarios encounter label noises in the client-side datasets. The resulting degradation in global model performance raises the urgent need to address label noise. This paper proposes FedClean -- a novel general robust label noise correction for federated learning. FedClean first uses the local centralized noisy label learning to select clean samples to train a global model. Then, it employs a two-stage correction scheme to correct the noisy labels from two distinct perspectives of local noisy label learning and the global model. FedClean also proposes a novel model aggregation method, further reducing the impact of label noises. FedClean neither assumes the existence of clean clients nor the specific noise distributions, showing the maximum versatility. Extensive experimental results show that FedClean effectively identifies and rectifies label noises even if all clients exhibit label noises, which outperforms the state-of-the-art noise-label learning methods for federated learning.

Paperid:1750 poster

Authors:Chenguang Wang, Kaiyuan Cui, Weichen Zhao, Tianshu Yu

Title: Sampling from Binary Quadratic Distributions via Stochastic Localization

Abstract:
Sampling from binary quadratic distributions (BQDs) is a fundamental but challenging problem in discrete optimization and probabilistic inference. Previous work established theoretical guarantees for stochastic localization (SL) in continuous domains, where MCMC methods efficiently estimate the required posterior expectations during SL iterations. However, achieving similar convergence guarantees for discrete MCMC samplers in posterior estimation presents unique theoretical challenges. In this work, we present the first application of SL to general BQDs, proving that after a certain number of iterations, the external field of posterior distributions constructed by SL tends to infinity almost everywhere, hence satisfy Poincaré inequalities with probability near to 1, leading to polynomial-time mixing. This theoretical breakthrough enables efficient sampling from general BQDs, even those that may not originally possess fast mixing properties. Furthermore, our analysis, covering enormous discrete MCMC samplers based on Glauber dynamics and Metropolis-Hastings algorithms, demonstrates the broad applicability of our theoretical framework. Experiments on instances with quadratic unconstrained binary objectives, including maximum independent set, maximum cut, and maximum clique problems, demonstrate consistent improvements in sampling efficiency across different discrete MCMC samplers.

Paperid:1751 poster

Authors:Zhiyao Ren, Siyuan Liang, Aishan Liu, Dacheng Tao

Title: ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks

Abstract:
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs) due to its adaptability and parameter-free nature. However, it also introduces a critical vulnerability to backdoor attacks, where adversaries can manipulate LLM behaviors by simply poisoning a few ICL demonstrations. In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts within poisoned demonstrations, jointly influencing the probability of model outputs. Through theoretical analysis, we derive an upper bound for ICL backdoor effects, revealing that the vulnerability is dominated by the concept preference ratio between the task and the backdoor. Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio. Our method encourages LLMs to select clean demonstrations during the ICL phase by leveraging confidence and similarity scores, effectively mitigating susceptibility to backdoor attacks. Extensive experiments across multiple LLMs and tasks demonstrate that our method achieves state-of-the-art defense effectiveness, significantly outperforming existing approaches (+26.02% on average). Furthermore, our method exhibits exceptional adaptability and defensive performance even for closed-source models (e.g., GPT-4).

Paperid:1752 poster

Authors:Jan Pauls, Max Zimmer, Berkant Turan, Sassan Saatchi, Philippe CIAIS, Sebastian Pokutta, Fabian Gieseke

Title: Capturing Temporal Dynamics in Large-Scale Canopy Tree Height Estimation

Abstract:
With the rise in global greenhouse gas emissions, accurate large-scale tree canopy height maps are essential for understanding forest structure, estimating above-ground biomass, and monitoring ecological disruptions. To this end, we present a novel approach to generate large-scale, high-resolution canopy height maps over time. Our model accurately predicts canopy height over multiple years given Sentinel 2 time series satellite data. Using GEDI LiDAR data as the ground truth for training the model, we present the first 10 m resolution temporal canopy height map of the European continent for the period 2019�?022. As part of this product, we also offer a detailed canopy height map for 2020, providing more precise estimates than previous studies. Our pipeline and the resulting temporal height map are publicly available, enabling comprehensive large-scale monitoring of forests and, hence, facilitating future research and ecological analyses. For an interactive viewer, seehttps://europetreemap.projects.earthengine.app/view/europeheight.

Paperid:1753 poster

Authors:Zhao Yang, Jiwei Zhu, Bing Su

Title: SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model

Abstract:
Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce ourSpecies-ProfileAdaptiveCollaborativeExperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners.

Paperid:1754 poster

Authors:Kuo Yang, Zhengyang Zhou, Qihe Huang, Wenjie Du, Limin Li, Wu Jiang, Yang Wang

Title: Enhancing Graph Invariant Learning from a Negative Inference Perspective

Abstract:
The out-of-distribution (OOD) generalization challenge is a longstanding problem in graph learning. Through studying the fundamental cause of data distribution shift, i.e., the changes of environments, significant progress has been achieved in addressing this issue. However, we observe that existing works still fail to effectively address complex environment shifts. Existing practices place excessive attention on extracting causal subgraphs, inevitably treating spurious subgraphs as environment variables. While spurious subgraphs are controlled by environments, the space of environment changes encompass more than the scale of spurious subgraphs. Therefore, existing efforts have a limited inference space for environments, leading to failure under severe environment changes. To tackle this issue, we propose a negative inference graph OOD framework (NeGo) to broaden the inference space for environment factors. Inspired by the successful practice of prompt learning in capturing underlying semantics and causal associations in large language models, we design a negative prompt environment inference to extract underlying environment information. We further introduce the environment-enhanced invariant subgraph learning to effectively exploit inferred environment embedding, ensuring the robust extraction of causal subgraph in the environment shifts. Lastly, we conduct a comprehensive evaluation of NeGo on real-world datasets and synthetic datasets across domains. NeGo outperforms baselines on nearly all datasets, which verify the effectiveness of our framework.

Paperid:1755 poster

Authors:Yibo Xu, Dawei Zhou, Decheng Liu, Nannan Wang

Title: Phase and Amplitude-aware Prompting for Enhancing Adversarial Robustness

Abstract:
Deep neural networks are found to be vulnerable to adversarial perturbations. The prompt-based defense has been increasingly studied due to its high efficiency. However, existing prompt-based defenses mainly exploited mixed prompt patterns, where critical patterns closely related to object semantics lack sufficient focus. The phase and amplitude spectra have been proven to be highly related to specific semantic patterns and crucial for robustness. To this end, in this paper, we propose a Phase and Amplitude-aware Prompting (PAP) defense. Specifically, we construct phase-level and amplitude-level prompts for each class, and adjust weights for prompting according to the model's robust performance under these prompts during training. During testing, we select prompts for each image using its predicted label to obtain the prompted image, which is inputted to the model to get the final prediction. Experimental results demonstrate the effectiveness of our method.

Paperid:1756 poster

Authors:Spencer Young, Porter Jenkins, Longchao Da, Jeff Dotson, Hua Wei

Title: Fully Heteroscedastic Count Regression with Deep Double Poisson Networks

Abstract:
Neural networks capable of accurate, input-conditional uncertainty representation are essential for real-world AI systems. Deep ensembles of Gaussian networks have proven highly effective for continuous regression due to their ability to flexibly represent aleatoric uncertainty via unrestricted heteroscedastic variance, which in turn enables accurate epistemic uncertainty estimation. However, no analogous approach exists forcountregression, despite many important applications. To address this gap, we propose the Deep Double Poisson Network (DDPN), a novel neural discrete count regression model that outputs the parameters of the Double Poisson distribution, enabling arbitrarily high or low predictive aleatoric uncertainty for count data and improving epistemic uncertainty estimation when ensembled. We formalize and prove that DDPN exhibits robust regression properties similar to heteroscedastic Gaussian models via learnable loss attenuation, and introduce a simple loss modification to control this behavior. Experiments on diverse datasets demonstrate that DDPN outperforms current baselines in accuracy, calibration, and out-of-distribution detection, establishing a new state-of-the-art in deep count regression.

Paperid:1757 poster

Authors:Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, Bohan Zhuang

Title: ZipAR: Parallel Autoregressive Image Generation through Spatial Locality

Abstract:
In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating autoregressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel. To ensure alignment with the contextual requirements of each token, we employ an adaptive local window assignment scheme with rejection sampling analogous to speculative decoding. By decoding multiple tokens in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.

Paperid:1758 poster

Authors:Abudukelimu Wuerkaixi, Qizhou Wang, Sen Cui, Wutong Xu, Bo Han, Gang Niu, Masashi Sugiyama, Changshui Zhang

Title: Adaptive Localization of Knowledge Negation for Continual LLM Unlearning

Abstract:
With the growing deployment of large language models (LLMs) across diverse domains, concerns regarding their safety have grown substantially. LLM unlearning has emerged as a pivotal approach to removing harmful or unlawful contents while maintaining utility. Despite increasing interest, the challenges of continual unlearning, which is common in real-world scenarios, remain underexplored. Successive unlearning tasks often lead to intensified utility degradation. To effectively unlearn targeted knowledge while preserving LLM utility, it is essential to minimize changes in model parameters by selectively updating those linked to the target knowledge, thereby ensuring other knowledge remains unaffected. Building on the task vector framework, we propose a new method named ALKN (Adaptive Localization of Knowledge Negation), which uses dynamic masking to sparsify training gradients and adaptively adjusts unlearning intensity based on inter-task relationships. Comprehensive experiments across three well-established LLM unlearning datasets demonstrate that our approach consistently outperforms baseline methods in both unlearning effectiveness and utility retention under continual unlearning settings.

Paperid:1759 poster

Authors:Penghao Wu, Lewei Lu, Ziwei Liu

Title: Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM

Abstract:
Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further.

Paperid:1760 poster

Authors:Jue Gong, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang

Title: Human Body Restoration with One-Step Diffusion Model and A New Benchmark

Abstract:
Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (PERSONA) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we proposeOSDHuman, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code are available at:https://github.com/gobunu/OSDHuman.

Paperid:1761 poster

Authors:Feng Wang, Yaodong Yu, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

Title: Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

Abstract:
Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a common image pre-processing approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1*1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models.

Paperid:1762 poster

Authors:Yi He, Yiming Yang, Xiaoyuan Cheng, Hai Wang, Xiao Xue, Boli Chen, Yukun Hu

Title: Chaos Meets Attention: Transformers for Large-Scale Dynamical Prediction

Abstract:
Generating long-term trajectories of dissipative chaotic systems autoregressively is a highly challenging task. The inherent positive Lyapunov exponents amplify prediction errors over time. Many chaotic systems possess a crucial property �?ergodicity on their attractors, which makes long-term prediction possible. State-of-the-art methods address ergodicity by preserving statistical properties using optimal transport techniques. However, these methods face scalability challenges due to the curse of dimensionality when matching distributions. To overcome this bottleneck, we propose a scalable transformer-based framework capable of stably generating long-term high-dimensional and high-resolution chaotic dynamics while preserving ergodicity. Our method is grounded in a physical perspective, revisiting the Von Neumann mean ergodic theorem to ensure the preservation of long-term statistics in theL2space. We introduce novel modifications to the attention mechanism, making the transformer architecture well-suited for learning large-scale chaotic systems. Compared to operator-based and transformer-based methods, our model achieves better performances across five metrics, from short-term prediction accuracy to long-term statistics. In addition to our methodological contributions, we introduce new chaotic system benchmarks: a machine learning dataset of 140ksnapshots of turbulent channel flow and a processed high-dimensional Kolmogorov Flow dataset, along with various evaluation metrics for both short- and long-term performances. Both are well-suited for machine learning research on chaotic systems.

Paperid:1763 poster

Authors:Jianyu Wang, Zhiqiang Hu, Lidong Bing

Title: Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective

Abstract:
We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent ''gibberish'' can remarkably improve performance across diverse tasks. Notably, the ''gibberish'' always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature—such as symbiosis and self-organization—arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.

Paperid:1764 poster

Authors:Arto Maranjyan, Alexander Tyurin, Peter Richtárik

Title: Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity

Abstract:
Asynchronous Stochastic Gradient Descent (Asynchronous SGD) is a cornerstone method for parallelizing learning in distributed machine learning. However, its performance suffers under arbitrarily heterogeneous computation times across workers, leading to suboptimal time complexity and inefficiency as the number of workers scales. While several Asynchronous SGD variants have been proposed, recent findings by Tyurin & Richtárik (NeurIPS 2023) reveal that none achieve optimal time complexity, leaving a significant gap in the literature. In this paper, we propose Ringmaster ASGD, a novel Asynchronous SGD method designed to address these limitations and tame the inherent challenges of Asynchronous SGD. We establish, through rigorous theoretical analysis, that Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous and dynamically fluctuating worker computation times. This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.

Paperid:1765 poster

Authors:Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, Priyadarshini Panda

Title: GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration

Abstract:
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures. Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we callasymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02—the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available atGithub.

Paperid:1766 poster

Authors:Qiyu Zhong, Yi Shan, Haobo Wang, Zhen Yang, Gengyu Lyu

Title: Tensorized Multi-View Multi-Label Classification via Laplace Tensor Rank

Abstract:
In multi-view multi-label classification (MVML), each object has multiple heterogeneous views and is annotated with multiple labels. The key to deal with such problem lies in how to capture cross-view consistent correlations while excavate multi-label semantic relationships. Existing MVML methods usually employ two independent components to address them separately, and ignores their potential interaction relationships. To address this issue, we propose a novel Tensorized MVML method named TMvML, which formulates an MVML tensor classifier to excavate comprehensive cross-view feature correlations while characterize complete multi-label semantic relationships. Specifically, we first reconstruct the MVML mapping matrices as an MVML tensor classifier. Then, we rotate the tensor classifier and introduce a low-rank tensor constraint to ensure view-level feature consistency and label-level semantic co-occurrence simultaneously. To better characterize the low-rank tensor structure, we design a new Laplace Tensor Rank (LTR), which serves as a tighter surrogate of tensor rank to capture high-order fiber correlations within the tensor space. By conducting the above operations, our method can easily address the two key challenges in MVML via a concise LTR tensor classifier and achieve the extraction of both cross-view consistent correlations and multi-label semantic relationships simultaneously. Extensive experiments demonstrate that TMvML significantly outperforms state-of-the-art methods.

Paperid:1767 poster

Authors:Jingyao Wang, Siyu Zhao, Wenwen Qiang, Jiangmeng Li, Changwen Zheng, Fuchun Sun, Hui Xiong

Title: Towards the Causal Complete Cause of Multi-Modal Representation Learning

Abstract:
Multi-Modal Learning (MML) aims to learn effective representations across modalities for accurate predictions. Existing methods typically focus on modality consistency and specificity to learn effective representations. However, from a causal perspective, they may lead to representations that contain insufficient and unnecessary information. To address this, we propose that effective MML representations should be causally sufficient and necessary. Considering practical issues like spurious correlations and modality conflicts, we relax the exogeneity and monotonicity assumptions prevalent in prior works and explore the concepts specific to MML, i.e., Causal Complete Cause (C3). We begin by definingC3, which quantifies the probability of representations being causally sufficient and necessary. We then discuss the identifiability ofC3and introduce an instrumental variable to support identifyingC3with non-exogeneity and non-monotonicity. Building on this, we conduct theC3measurement, i.e.,C3risk. We propose a twin network to estimate it through (i) the real-world branch: utilizing the instrumental variable for sufficiency, and (ii) the hypothetical-world branch: applying gradient-based counterfactual modeling for necessity. Theoretical analyses confirm its reliability. Based on these results, we proposeC3Regularization, a plug-and-play method that enforces the causal completeness of the learned representations by minimizingC3risk. Extensive experiments demonstrate its effectiveness.

Paperid:1768 poster

Authors:Alexander Tyurin

Title: Toward a Unified Theory of Gradient Descent under Generalized Smoothness

Abstract:
We study the classical optimization problem $\min_{x \in \mathbb{R}^d} f(x)$ and analyze the gradient descent (GD) method in both nonconvex and convex settings. It is well-known that, under the $L$–smoothness assumption ($|| \nabla^2 f(x) || \leq L$), the optimal point minimizing the quadratic upper bound $f(x_k) + \langle \nabla f(x_k), x_{k+1} - x_k \rangle + \frac{L}{2} || x_{k+1} - x_k ||^2$ is $x_{k+1} = x_k - \gamma_k \nabla f(x_k)$ with step size $\gamma_k = \frac{1}{L}$. Surprisingly, a similar result can be derived under the $\ell$-generalized smoothness assumption ($|| \nabla^2 f(x) || \leq \ell( || \nabla f(x) || )$). In this case, we derive the step size $$\gamma_k = \int_{0}^{1} \frac{d v}{\ell( || \nabla f(x_k) || + || \nabla f(x_k) || v)}.$$ Using this step size rule, we improve upon existing theoretical convergence rates and obtain new results in several previously unexplored setups.

Paperid:1769 poster

Authors:Tom Jacobs, Chao Zhou, Rebekka Burkholz

Title: Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?

Abstract:
Implicit bias plays an important role in explaining how overparameterized models generalize well. Explicit regularization like weight decay is often employed in addition to prevent overfitting. While both concepts have been studied separately, in practice, they often act in tandem. Understanding their interplay is key to controlling the shape and strength of implicit bias, as it can be modified by explicit regularization. To this end, we incorporate explicit regularization into the mirror flow framework and analyze its lasting effects on the geometry of the training dynamics, covering three distinct effects: positional bias, type of bias, and range shrinking. Our analytical approach encompasses a broad class of problems, including sparse coding, matrix sensing, single-layer attention, and LoRA, for which we demonstrate the utility of our insights. To exploit the lasting effect of regularization and highlight the potential benefit of dynamic weight decay schedules, we propose to switch off weight decay during training, which can improve generalization, as we demonstrate in experiments.

Paperid:1770 poster

Authors:Huayu Deng, Xiangming Zhu, Yunbo Wang, Xiaokang Yang

Title: EvoMesh: Adaptive Physical Simulation with Hierarchical Graph Evolutions

Abstract:
Graph neural networks have been a powerful tool for mesh-based physical simulation. To efficiently model large-scale systems, existing methods mainly employ hierarchical graph structures to capture multi-scale node relations. However, these graph hierarchies are typically manually designed and fixed, limiting their ability to adapt to the evolving dynamics of complex physical systems. We propose EvoMesh, a fully differentiable framework that jointly learns graph hierarchies and physical dynamics, adaptively guided by physical inputs. EvoMesh introduces anisotropic message passing, which enables direction-specific aggregation of dynamic features between nodes within each hierarchy, while simultaneously learning node selection probabilities for the next hierarchical level based on physical context. This design creates more flexible message shortcuts and enhances the model's capacity to capture long-range dependencies. Extensive experiments on five benchmark physical simulation datasets show that EvoMesh outperforms recent fixed-hierarchy message passing networks by large margins. The project page is available athttps://hbell99.github.io/evo-mesh/.

Paperid:1771 poster

Authors:Seungho Baek, taegeon park, Jongchan Park, Seungjun Oh, Yusung Kim

Title: Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning

Abstract:
Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at:https://github.com/qortmdgh4141/GAS.

Paperid:1772 poster

Authors:Ke Liu, Hao Chen, Chunhua Shen

Title: Physics Aware Neural Networks for Unsupervised Binding Energy Prediction

Abstract:
Developing models for protein-ligand interactions holds substantial significance for drug discovery. Supervised methods often failed due to the lack of labeled data for predicting the protein-ligand binding energy, like antibodies. Therefore, unsupervised approaches are urged to make full use of the unlabeled data. To tackle the problem, we propose an efficient, unsupervised protein-ligand binding energy prediction model via the conservation of energy (CEBind), which follows the physical laws. Specifically, given a protein-ligand complex, we randomly sample forces for each atom in the ligand. Then these forces are applied rigidly to the ligand to perturb its position, following the law of rigid body dynamics. Finally, CEBind predicts the energy of both the unperturbed complex and the perturbed complex. The energy gap between two complexes equals the work of the outer forces, following the law of conservation of energy. Extensive experiments are conducted on the unsupervised protein-ligand binding energy prediction benchmarks, comparing them with previous works. Empirical results and theoretic analysis demonstrate that CEBind is more efficient and outperforms previous unsupervised models on benchmarks.

Paperid:1773 poster

Authors:Sean I. Young

Title: Radio: Rate–Distortion Optimization for Large Language Model Compression

Abstract:
In recent years, the compression of large language models (LLMs) has emerged as a key problem in facilitating LLM deployment on resource-limited devices, reducing compute costs, and mitigating the environmental footprint due to large-scale AI infrastructure. Here, we establish the foundations of LLM quantization from a rate–distortion theory perspective and propose a quantization technique based on simple rate–distortion optimization. Our technique scales to models containing hundreds of billions of weight parameters and offers users the flexibility to compress models, post-training, to a model size or accuracy specified by the user.

Paperid:1774 poster

Authors:Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, Yunhe Wang

Title: Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Abstract:
Large Language Models (LLMs) have demonstrated remarkable abilities across various language tasks, but solving complex reasoning problems remains a significant challenge. While existing methods, such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT), enhance reasoning by decomposing problems or structuring prompts, they typically perform a single pass of reasoning and may fail to revisit flawed paths, compromising accuracy. To address this limitation, we propose a novel reasoning framework called Forest-of-Thought (FoT), which integrates multiple reasoning trees to leverage collective decision-making for solving complex logical problems. FoT employs sparse activation strategies to select the most relevant reasoning paths, improving both efficiency and accuracy. Additionally, we introduce a dynamic self-correction strategy that enables real-time error correction, along with consensus-guided decision-making strategies to optimize both correctness and computational resources. Experimental results demonstrate that the FoT framework, combined with these strategies, significantly enhances the reasoning capabilities of LLMs, enabling them to solve complex tasks with greater precision and efficiency.

Paperid:1775 poster

Authors:Yixin Zha, Chuxin Wang, Wenfei Yang, Tianzhu Zhang, Feng Wu

Title: Exploring Vision Semantic Prompt for Efficient Point Cloud Understanding

Abstract:
A series of pre-trained models have demonstrated promising results in point cloud understanding tasks and are widely applied to downstream tasks through fine-tuning. However, full fine-tuning leads to the forgetting of pretrained knowledge and substantial storage costs on edge devices. To address these issues, Parameter-Efficient Transfer Learning (PETL) methods have been proposed. According to our analysis, we find that existing 3D PETL methods cannot adequately align with semantic relationships of features required by downstream tasks, resulting in suboptimal performance. To ensure parameter efficiency while introducing rich semantic cues, we propose a novel fine-tuning paradigm for 3D pre-trained models. We utilize frozen 2D pre-trained models to provide vision semantic prompts and design a new Hybrid Attention Adapter to efficiently fuse 2D semantic cues into 3D representations with minimal trainable parameters(1.8M). Extensive experiments conducted on datasets including ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed paradigm. In particular, our method achieves 95.6% accuracy on ModelNet40 and attains 90.09% performance on the most challenging classification split ScanObjectNN(PB-T50-RS).

Paperid:1776 poster

Authors:Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, Jianyu Chen

Title: UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

Abstract:
Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.

Paperid:1777 poster

Authors:Thibault de Surrel, Fabien Lotte, Sylvain Chevallier, Florian Yger

Title: Wrapped Gaussian on the manifold of Symmetric Positive Definite Matrices

Abstract:
Circular and non-flat data distribution are prevalent across diverse domains of data science, yet their specific geometric structures often remain underutilized in machine learning frameworks. A principled approach to accounting for the underlying geometry of such data is pivotal, particularly when extending statistical models, like the pervasive Gaussian distribution. In this work, we tackle those issue by focusing on the manifold of symmetric positive definite matrices, a key focus in information geometry. We introduced a non-isotropic wrapped Gaussian by leveraging the exponential map, we derive theoretical properties of this distribution and propose a maximum likelihood framework for parameter estimation. Furthermore, we reinterpret established classifiers on SPD through a probabilistic lens and introduce new classifiers based on the wrapped Gaussian model. Experiments on synthetic and real-world datasets demonstrate the robustness and flexibility of this geometry-aware distribution, underscoring its potential to advance manifold-based data analysis. This work lays the groundwork for extending classical machine learning and statistical methods to more complex and structured data.

Paperid:1778 poster

Authors:Erica Zhang, Fangzhao Zhang, Mert Pilanci

Title: Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes

Abstract:
Active learning methods aim to improve sample complexity in machine learning. In this work, we investigate an active learning scheme via a novel gradient-free cutting-plane training method for ReLU networks of arbitrary depth and develop a convergence theory. We demonstrate, for the first time, that cutting-plane algorithms, traditionally used in linear models, can be extended to deep neural networks despite their nonconvexity and nonlinear decision boundaries. Moreover, this training method induces the first deep active learning scheme known to achieve convergence guarantees, revealing a geometric contraction rate of the feasible set. We exemplify the effectiveness of our proposed active learning method against popular deep active learning baselines via both synthetic data experiments and sentimental classification task on real datasets.

Paperid:1779 poster

Authors:Zhuoling Li, Xiaogang Xu, Zhenhua Xu, Ser-Nam Lim, Hengshuang Zhao

Title: LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

Abstract:
Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods.

Paperid:1780 poster

Authors:Shibo Jie, Yehui Tang, Kai Han, Zhi-Hong Deng, Jing Han

Title: SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Abstract:
Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-precision KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.

Paperid:1781 poster

Authors:Qiyao Liang, Daoyuan Qian, Liu Ziyin, Ila R Fiete

Title: Compositional Generalization via Forced Rendering of Disentangled Latents

Abstract:
Composition—the ability to generate myriad variations from finite means—is believed to underlie powerful generalization. However, compositional generalization remains a key challenge for deep learning. A widely held assumption is that learning disentangled (factorized) representations naturally supports this kind of extrapolation. Yet, empirical results are mixed, with many generative models failing to recognize and compose factors to generate out-of-distribution (OOD) samples. In this work, we investigate a controlled 2D Gaussian "bump" generation task with fully disentangled $(x,y)$ inputs, demonstrating that standard generative architectures still fail in OOD regions when training with partial data, by re-entangling latent representations in subsequent layers. By examining the model's learned kernels and manifold geometry, we show that this failure reflects a "memorization" strategy for generation via data superposition rather than via composition of the true factorized features. We show that when models are forced—through architectural modifications with regularization or curated training data—to render the disentangled latents into the full-dimensional representational (pixel) space, they can be highly data-efficient and effective at composing in OOD regions. These findings underscore that disentangled latents in an abstract representation are insufficient and show that if models can represent disentangled factors directly in the output representational space, it can achieve robust compositional generalization.

Paperid:1782 poster

Authors:Paulius Rauba, Qiyao Wei, Mihaela van der Schaar

Title: Statistical Hypothesis Testing for Auditing Robustness in Language Models

Abstract:
Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.

Paperid:1783 poster

Authors:Zihan Zhou, Yang Zhou, Zijie Zhang, Lingjuan Lyu, Da Yan, Ruoming Jin, Dejing Dou

Title: Flexible, Efficient, and Stable Adversarial Attacks on Machine Unlearning

Abstract:
Machine unlearning (MU) aims to remove the influence of specific data points from trained models, enhancing compliance with privacy regulations. However, the vulnerability of basic MU models to malicious unlearning requests in adversarial learning environments has been largely overlooked. Existing adversarial MU attacks suffer from three key limitations: inflexibility due to pre-defined attack targets, inefficiency in handling multiple attack requests, and instability caused by non-convex loss functions. To address these challenges, we propose a Flexible, Efficient, and Stable Attack (DDPA). First, leveraging Carathéodory's theorem, we introduce a convex polyhedral approximation to identify points in the loss landscape where convexity approximately holds, ensuring stable attack performance. Second, inspired by simplex theory and John's theorem, we develop a regular simplex detection technique that maximizes coverage over the parameter space, improving attack flexibility and efficiency. We theoretically derive the proportion of the effective parameter space occupied by the constructed simplex. We evaluate the attack success rate of our DDPA method on real datasets against state-of-the-art machine unlearning attack methods. Our source code is available athttps://github.com/zzz0134/DDPA.

Paperid:1784 poster

Authors:Xiner Li, Limei Wang, Youzhi Luo, Carl Edwards, Shurui Gui, Yuchao Lin, Heng Ji, Shuiwang Ji

Title: Geometry Informed Tokenization of Molecules for Language Model Generation

Abstract:
We consider molecule generation in 3D space using language models (LMs), which requires discrete tokenization of 3D molecular geometries. Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored. Here, we attempt to bridge this gap by proposing a novel method which converts molecular geometries into SE(3)-invariant 1D discrete sequences. Our method consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs. Our experiments show that, when coupled with our proposed method, various LMs excel in molecular geometry generation, especially in controlled generation tasks. Our code has been released as part of the AIRS library (https://github.com/divelab/AIRS/).

Paperid:1785 poster

Authors:Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X-F. Ye, Molei Tao

Title: Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces

Abstract:
Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.

Paperid:1786 poster

Authors:Yitian Zhang, Liheng Ma, Antonios Valkanas, Boris N. Oreshkin, Mark Coates

Title: SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting

Abstract:
Koopman operator theory provides a framework for nonlinear dynamical system analysis and time-series forecasting by mapping dynamics to a space of real-valued measurement functions, enabling a linear operator representation. Despite the advantage of linearity, the operator is generally infinite-dimensional. Therefore, the objective is to learn measurement functions that yield a tractable finite-dimensional Koopman operator approximation. In this work, we establish a connection between Koopman operator approximation and linear Recurrent Neural Networks (RNNs), which have recently demonstrated remarkable success in sequence modeling. We show that by considering an extended state consisting of lagged observations, we can establish an equivalence between a structured Koopman operator and linear RNN updates. Building on this connection, we present SKOLR, which integrates a learnable spectral decomposition of the input signal with a multilayer perceptron (MLP) as the measurement functions and implements a structured Koopman operator via a highly parallel linear RNN stack. Numerical experiments on various forecasting benchmarks and dynamical systems show that this streamlined, Koopman-theory-based design delivers exceptional performance. Our code is available at:https://github.com/networkslab/SKOLR.

Paperid:1787 poster

Authors:Dongwoo Lee, Dong Bok Lee, Steven Adriaensen, Juho Lee, Sung Ju Hwang, Frank Hutter, Seon Joo Kim, Hae Beom Lee

Title: Bayesian Neural Scaling Law Extrapolation with Prior-Data Fitted Networks

Abstract:
Scaling has been a major driver of recent advancements in deep learning. Numerous empirical studies have found that scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales. However, existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications involving decision-making problems such as determining the expected performance improvements achievable by investing additional computational resources. In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation. Specifically, we design a prior distribution that enables the sampling of infinitely many synthetic functions resembling real-world neural scaling laws, allowing our PFN to meta-learn the extrapolation. We validate the effectiveness of our approach on real-world neural scaling laws, comparing it against both the existing point estimation methods and Bayesian approaches. Our method demonstrates superior performance, particularly in data-limited scenarios such as Bayesian active learning, underscoring its potential for reliable, uncertainty-aware extrapolation in practical applications.

Paperid:1788 poster

Authors:Aaron J Havens, Benjamin Kurt Miller, Bing Yan, Carles Domingo-Enrich, Anuroop Sriram, Daniel S. Levine, Brandon M Wood, Bin Hu, Brandon Amos, Brian Karrer, Xiang Fu, Guan-Horng Liu, Ricky T. Q. Chen

Title: Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching

Abstract:
We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry. Code & and benchmarks provided athttps://github.com/facebookresearch/adjoint_sampling.

Paperid:1789 poster

Authors:Qi Xu, Jie Deng, Jiangrong Shen, Biwu Chen, Huajin Tang, Gang Pan

Title: Hybrid Spiking Vision Transformer for Object Detection with Event Cameras

Abstract:
Event-based object detection has attracted increasing attention for its high temporal resolution, wide dynamic range, and asynchronous address-event representation. Leveraging these advantages, spiking neural networks (SNNs) have emerged as a promising approach, offering low energy consumption and rich spatiotemporal dynamics. To further enhance the performance of event-based object detection, this study proposes a novel hybrid spike vision Transformer (HsVT) model. The HsVT model integrates a spatial feature extraction module to capture local and global features, and a temporal feature extraction module to model time dependencies and long-term patterns in event sequences. This combination enables HsVT to capture spatiotemporal features, improving its capability in handling complex event-based object detection tasks. To support research in this area, we developed the Fall Detection dataset as a benchmark for event-based object detection tasks. The Fall DVS detection dataset protects facial privacy and reduces memory usage thanks to its event-based representation. Experimental results demonstrate that HsVT outperforms existing SNN methods and achieves competitive performance compared to ANN-based models, with fewer parameters and lower energy consumption.

Paperid:1790 poster

Authors:Rahul Krishna Thomas, Louai Zahran, Erica Choi, Akilesh Potti, Micah Goldblum, Arka Pal

Title: Hidden No More: Attacking and Defending Private Third-Party LLM Inference

Abstract:
Recent advances in Large Language Models (LLMs) have led to widespread adoption of third-party inference services, raising critical privacy concerns. In this work, we introduce a novel reconstruction technique that can recover original prompts from hidden states with nearly perfect accuracy across multiple state-of-the-art LLMs in the increasingly important open-weights setting. Although the attack is conceptually simple, it has not -- to the best of our knowledge -- previously been described nor shown to work practically. Furthermore, our attack remains effective against various permutation and noise-based defenses, challenging assumptions about the security of previously proposed schemes. To address these vulnerabilities, we propose Cascade, a multi-party inference scheme that leverages sharding in the sequence dimension to retain privacy of the user input. Through theoretical analysis and empirical evaluation, we demonstrate that Cascade is secure against both our attack as well as previous methods, while maintaining computational and communication efficiency. Our findings highlight the importance of rigorous security analysis in privacy-preserving LLM inference and offer practical solutions for secure deployment.

Paperid:1791 poster

Authors:Ignavier Ng, Yan Li, Zijian Li, Yujia Zheng, Guangyi Chen, Kun Zhang

Title: A General Representation-Based Approach to Multi-Source Domain Adaptation

Abstract:
A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label's Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label's parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.

Paperid:1792 poster

Authors:Matthew Faw, Constantine Caramanis, Jessica Hoffmann

Title: On Mitigating Affinity Bias through Bandits with Evolving Biased Feedback

Abstract:
Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe thebiased valueof a candidate, but we want to optimize with respect to theirreal value2) the bias towards a candidate with a specific set of traits depends on thefractionof people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instance-dependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of $K$. Since we treat rewards that aretime-varyinganddependent on the policy's past actions, deriving this lower bound requires developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.

Paperid:1793 poster

Authors:Weikang Qiu, Zheng Huang, Haoyu Hu, Aosong Feng, Yujun Yan, Rex Ying

Title: MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-text Decoding

Abstract:
Decoding functional magnetic resonance imaging (fMRI) signals into text has been a key challenge in the neuroscience community, with the potential to advance brain-computer interfaces and uncover deeper insights into brain mechanisms. However, existing approaches often struggle with suboptimal predictive performance, limited task variety, and poor generalization across subjects. In response to this, we propose MindLLM, a model designed for subject-agnostic and versatile fMRI-to-text decoding. MindLLM consists of an fMRI encoder and an off-the-shelf LLM. The fMRI encoder employs a neuroscience-informed attention mechanism, which is capable of accommodating subjects with varying input shapes and thus achieves high-performance subject-agnostic decoding. Moreover, we introduce Brain Instruction Tuning (BIT), a novel approach that enhances the model's ability to capture diverse semantic representations from fMRI signals, facilitating more versatile decoding. We evaluate MindLLM on comprehensive fMRI-to-text benchmarks. Results demonstrate that our model outperforms the baselines, improving downstream tasks by $12.0%$, unseen subject generalization by $24.5%$, and novel task adaptation by $25.0%$. Furthermore, the attention patterns in MindLLM provide interpretable insights into its decision-making process.

Paperid:1794 poster

Authors:Chao Ma, Wenbo Gong, Meyer Scetbon, Edward Meeds

Title: SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

Abstract:
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require maintaining optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b) In this work, we show that pre-processing SGD using normalization and whitening in a stateless manner can achieve similar performance as Adam for LLM training, while maintaining the same memory footprint of SGD. Specifically, we show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN achieves ~50% reduction on total end-to-end memory compared to Adam. Under the memory-efficienct LLaMA training benchmark of (Zhao et al., 2024a), SWAN reaches the same evaluation perplexity using half as many tokens for 350M and 1.3B model.

Paperid:1795 poster

Authors:Kexin Huang, Junkang Wu, Ziqian Chen, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang

Title: Larger or Smaller Reward Margins to Select Preferences for LLM Alignment?

Abstract:
Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on eitherexplicitorimplicitreward margins, their single-margin focus often leads to contradictory evaluations for the same data. To address this issue, we propose a new metric ofalignment potential, $M_{AP}$, which integrates both margins to quantify the gap from the model'scurrent implicitreward margin to thetarget explicitreward margin, thereby estimating the model's potential to align on the preference data. Empirical results demonstrate that training on the data selected by $M_{AP}$ consistently enhances alignment performance, surpassing existing metrics across different base models and optimization objectives. Furthermore, our method can be extended to self-play data generation frameworks, where we use this metric to identify high-quality data within the self-generated content by LLMs. Under this data generation scenario, our method surpasses current state-of-the-art methods across various training settings and demonstrates continuous improvements with increasing dataset size and training iterations.

Paperid:1796 poster

Authors:Huafeng Liu, Yiran Fu, Liping Jing, Hui Li, Shuyang Lin, Jingyue Shi, Deqiang Ouyang, Jian Yu

Title: Learning Robust Neural Processes with Risk-Averse Stochastic Optimization

Abstract:
Neural processes (NPs) are a promising paradigm to enable skill transfer learning across tasks with the aid of the distribution of functions. The previous NPs employ the empirical risk minimization principle in optimization. However, the fast adaption ability to different tasks can vary widely, and the worst fast adaptation can be catastrophic in risk-sensitive tasks. To achieve robust neural processes modeling, we consider the problem of training models in a risk-averse manner, which can control the worst fast adaption cases at a certain probabilistic level. By transferring the risk minimization problem to a two-level finite sum minimax optimization problem, we can easily solve it via a double-looped stochastic mirror prox algorithm with a task-aware variance reduction mechanism via sampling samples across all tasks. The mirror prox technique ensures better handling of complex constraint sets and non-Euclidean geometries, making the optimization adaptable to various tasks. The final solution, by aggregating prox points with the adaptive learning rates, enables a stable and high-quality output. The proposed learning strategy can work with various NPs flexibly and achieves less biased approximation with a theoretical guarantee. To illustrate the superiority of the proposed model, we perform experiments on both synthetic and real-world data, and the results demonstrate that our approach not only helps to achieve more accurate performance but also improves model robustness.

Paperid:1797 poster

Authors:Chao Gao, Liren Shan, Vaidehi Srinivas, Aravindan Vijayaraghavan

Title: Volume Optimality in Conformal Prediction with Structured Prediction Sets

Abstract:
Conformal Prediction is a widely studied technique to construct prediction sets of future observations. Most conformal prediction methods focus on achieving the necessary coverage guarantees, but do not provide formal guarantees on the size (volume) of the prediction sets. We first prove the impossibility of volume optimality where any distribution-free method can only find a trivial solution. We then introduce a new notion of volume optimality by restricting the prediction sets to belong to a set family (of finite VC-dimension), specifically a union of $k$-intervals. Our main contribution is an efficient distribution-free algorithm based on dynamic programming (DP) to find a union of $k$-intervals that is guaranteed for any distribution to have near-optimal volume among all unions of $k$-intervals satisfying the desired coverage property. By adopting the framework of distributional conformal prediction (Chernozhukov et al., 2021), the new DP based conformity score can also be applied to achieve approximate conditional coverage and conditional restricted volume optimality, as long as a reasonable estimator of the conditional CDF is available. While the theoretical results already establish volume-optimality guarantees, they are complemented by experiments that demonstrate that our method can significantly outperform existing methods in many settings.

Paperid:1798 poster

Authors:Yu Wang, Mazdak Abulnaga, Yaël Balbastre, Bruce Fischl

Title: Schwarz–Schur Involution: Lightspeed Differentiable Sparse Linear Solvers

Abstract:
Sparse linear solvers are fundamental to science and engineering, applied in partial differential equations (PDEs), scientific computing, computer vision, and beyond. Indirect solvers possess characteristics that make them undesirable as stable differentiable modules; existing direct solvers, though reliable, are too expensive to be adopted in neural architectures. We substantially accelerate direct sparse solvers or generalized deconvolution by up to 3 orders-of-magnitude faster, violating common assumptions that direct solvers are too slow. We ``condense'' a sparse Laplacian matrix into a dense tensor, a compact data structure that batch-wise stores the Dirichlet-to-Neumann matrices, reducing the sparse solving to recursively merging pairs of dense matrices that are much smaller. The batched small dense systems are sliced and inverted in parallel to take advantage of dense GPU BLAS kernels, highly optimized in the era of deep learning. Our method is efficient, qualified as a strong zero-shot baseline for AI-based PDE solving and a reliable differentiable module integrable into machine learning pipelines.

Paperid:1799 poster

Authors:Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, Daniela Rus

Title: ABNet: Adaptive explicit-Barrier Net for Safe and Scalable Robot Learning

Abstract:
Safe learning is central to AI-enabled robots where a single failure may lead to catastrophic results. Existing safe learning methods are not scalable, inefficient and hard to train, and tend to generate unstable signals under noisy inputs that are challenging to be deployed for robots. To address these challenges, we propose Adaptive explicit-Barrier Net (ABNet) in which barriers explicitly show up in the closed-form model that guarantees safety. The ABNet has the potential to incrementally scale toward larger safe foundation models. Each head of ABNet could learn safe control policies from different features and focuses on specific part of the observation. In this way, we do not need to directly construct a large model for complex tasks, which significantly facilitates the training of the model while ensuring its stable output. Most importantly, we can still formally prove the safety guarantees of the ABNet. We demonstrate the efficiency and strength of ABNet in 2D robot obstacle avoidance, safe robot manipulation, and vision-based end-to-end autonomous driving, with results showing much better robustness and guarantees over existing models.

Paperid:1800 poster

Authors:Hengrui Zhang, Liancheng Fang, Qitian Wu, Philip S. Yu

Title: TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data

Abstract:
While autoregressive models dominate natural language generation, their application to tabular data remains limited due to two challenges: 1) tabular data contains heterogeneous types, whereas autoregressive next-token (distribution) prediction is designed for discrete data, and 2) tabular data is column permutation-invariant, requiring flexible generation orders. Traditional autoregressive models, with their fixed generation order, struggle with tasks like missing data imputation, where the target and conditioning columns vary. To address these issues, we propose Diffusion-nested Non-autoregressive Transformer (TabNAT), a hybrid model combining diffusion processes and masked generative modeling. For continuous columns, TabNAT uses a diffusion model to parameterize their conditional distributions, while for discrete columns, it employs next-token prediction with KL divergence minimization. A masked Transformer with bi-directional attention enables order-agnostic generation, allowing it to learn the distribution of target columns conditioned on arbitrary observed columns. Extensive experiments on ten datasets with diverse properties demonstrate TabNAT's superiority in both unconditional tabular data generation and conditional missing data imputation tasks.

Paperid:1801 poster

Authors:Matthieu Meeus, Lukas Wutschitz, Santiago Zanella-Beguelin, Shruti Tople, Reza Shokri

Title: The Canary’s Echo: Auditing Privacy Risks of LLM-Generated Synthetic Text

Abstract:
How much information about training samples can be leaked through synthetic data generated by Large Language Models (LLMs)? Overlooking the subtleties of information flow in synthetic data generation pipelines can lead to a false sense of privacy. In this paper, we assume an adversary has access to some synthetic data generated by a LLM. We design membership inference attacks (MIAs) that target the training data used to fine-tune the LLM that is then used to synthesize data. The significant performance of our MIA shows that synthetic data leak information about the training data. Further, we find that canaries crafted for model-based MIAs are sub-optimal for privacy auditing when only synthetic data is released. Such out-of-distribution canaries have limited influence on the model’s output when prompted to generate useful, in-distribution synthetic data, which drastically reduces their effectiveness. To tackle this problem, we leverage the mechanics of auto-regressive models to design canaries with an in-distribution prefix and a high-perplexity suffix that leave detectable traces in synthetic data. This enhances the power of data-based MIAs and provides a better assessment of the privacy risks of releasing synthetic data generated by LLMs.

Paperid:1802 poster

Authors:Xiuheng Wang, Ricardo Augusto Borsoi, Cédric Richard, Ali H. Sayed

Title: Riemannian Diffusion Adaptation for Distributed Optimization on Manifolds

Abstract:
Online distributed optimization is particularly useful for solving optimization problems with streaming data collected by multiple agents over a network. When the solutions lie on a Riemannian manifold, such problems become challenging to solve, particularly when efficiency and continuous adaptation are required. This work tackles these challenges and devises a diffusion adaptation strategy for decentralized optimization over general manifolds. A theoretical analysis shows that the proposed algorithm is able to approach network agreement after sufficient iterations, which allows a non-asymptotic convergence result to be derived. We apply the algorithm to the online decentralized principal component analysis problem and Gaussian mixture model inference. Experimental results with both synthetic and real data illustrate its performance.

Paperid:1803 poster

Authors:Hao Li, Hao Wan, Yuzhou Chen, Dongsheng Ye, Yulia Gel, Hao Jiang

Title: TMetaNet: Topological Meta-Learning Framework for Dynamic Link Prediction

Abstract:
Dynamic graphs evolve continuously, presenting challenges for traditional graph learning due to their changing structures and temporal dependencies. Recent advancements have shown potential in addressing these challenges by developing suitable meta-learning-based dynamic graph neural network models. However, most meta-learning approaches for dynamic graphs rely on fixed weight update parameters, neglecting the essential intrinsic complex high-order topological information of dynamically evolving graphs. We have designed Dowker Zigzag Persistence (DZP), an efficient and stable dynamic graph persistent homology representation method based on Dowker complex and zigzag persistence, to capture the high-order features of dynamic graphs. Armed with the DZP ideas, we propose TMetaNet, a new meta-learning parameter update model based on dynamic topological features. By utilizing the distances between high-order topological features, TMetaNet enables more effective adaptation across snapshots. Experiments on real-world datasets demonstrate TMetaNet's state-of-the-art performance and resilience to graph noise, illustrating its high potential for meta-learning and dynamic graph analysis. Our code is available athttps://github.com/Lihaogx/TMetaNet.

Paperid:1804 poster

Authors:Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Richard Charles Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Title: QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Abstract:
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates (>90%) and reliably provides consistent end-to-end speedups upto�?.5×, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by�?.3×compared to these alternatives.

Paperid:1805 poster

Authors:Puli Wang, Yu Qi, Yueming Wang, Gang Pan

Title: Flow Matching for Few-Trial Neural Adaptation with Stable Latent Dynamics

Abstract:
The primary goal of brain-computer interfaces (BCIs) is to establish a direct linkage between neural activities and behavioral actions via neural decoders. Due to the nonstationary property of neural signals, BCIs trained on one day usually obtain degraded performance on other days, hindering the user experience. Existing studies attempted to address this problem by aligning neural signals across different days. However, these neural adaptation methods may exhibit instability and poor performance when only a few trials are available for alignment, limiting their practicality in real-world BCI deployment. To achieve efficient and stable neural adaptation with few trials, we propose Flow-Based Distribution Alignment (FDA), a novel framework that utilizes flow matching to learn flexible neural representations with stable latent dynamics, thereby facilitating source-free domain alignment through likelihood maximization. The latent dynamics of FDA framework is theoretically proven to be stable using Lyapunov exponents, allowing for robust adaptation. Further experiments across multiple motor cortex datasets demonstrate the superior performance of FDA, achieving reliable results with fewer than five trials. Our FDA approach offers a novel and efficient solution for few-trial neural data adaptation, offering significant potential for improving the long-term viability of real-world BCI applications.

Paperid:1806 poster

Authors:Sharan Vaswani, Reza Babanezhad Harikandeh

Title: Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster

Abstract:
Armijo line-search (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant $L$ and adapts to the ``local'' smoothness, enabling GD to converge faster. Existing theoretical analyses show that GD with Armijo-LS ($\texttt{GD-LS}$) can result in constant factor improvements over GD with a $1/L$ step-size (denoted as $\texttt{GD(1/L)}$). We strengthen these results and show that if the objective function satisfies a certain non-uniform smoothness condition, $\texttt{GD-LS}$ can result in a faster convergence rate than $\texttt{GD(1/L)}$. In particular, we prove that for convex objectives corresponding to logistic regression and multi-class classification, $\texttt{GD-LS}$ can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of $\texttt{GD(1/L)}$. Furthermore, for non-convex objectives satisfying gradient domination (e.g., those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), $\texttt{GD-LS}$ can match the fast convergence of algorithms tailored for these specific settings. Finally, we prove that under the interpolation assumption, for convex losses, stochastic GD with a stochastic line-search can match the fast convergence of $\texttt{GD-LS}$.

Paperid:1807 poster

Authors:Weilin Cai, Juyong Jiang, Le Qin, junweicui, Sunghun Kim, Jiayi Huang

Title: Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

Abstract:
Expert parallelism has emerged as a key strategy for distributing the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple devices, enabling the processing of increasingly large-scale models. However, the All-to-All communication inherent to expert parallelism poses a significant bottleneck, limiting the efficiency of MoE models. Although existing optimization methods partially mitigate this issue, they remain constrained by the sequential dependency between communication and computation operations. To address this challenge, we propose ScMoE, a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy. ScMoE decouples communication from its conventional sequential ordering, enabling up to 100% overlap with computation. Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of $1.49\times$ in training and $1.82\times$ in inference. Moreover, our experiments and analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.

Paperid:1808 poster

Authors:Aaron T Wang, William Convertino, Xiang Cheng, Ricardo Henao, Lawrence Carin

Title: On Understanding Attention-Based In-Context Learning for Categorical Data

Abstract:
In-context learning based on attention models is examined for data with categorical outcomes, with inference in such models viewed from the perspective of functional gradient descent (GD). We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections. This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations. We perform a theoretical analysis of this setup, generalizing many prior assumptions in this line of work, including the class of attention mechanisms for which it is appropriate. We demonstrate the framework empirically on synthetic data, image classification and language generation.

Paperid:1809 poster

Authors:An Vo, Mohammad Reza Taesiri, Daeyoung Kim, Anh Totti Nguyen

Title: B-score: Detecting biases in large language models using response history

Abstract:
Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek a Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases in Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: b-score.github.io.

Paperid:1810 poster

Authors:Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou

Title: MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Abstract:
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

Paperid:1811 poster

Authors:Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, lulu hu

Title: La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Abstract:
Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (LayerwiseRotatedSparseActivation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30× wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.

Paperid:1812 poster

Authors:Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu

Title: Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Abstract:
Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77% percentage points. Our analysis indicates that MEAP’s effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model’s focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models. Code has been submitted.

Paperid:1813 poster

Authors:Like Jian, Dong Liu

Title: Widening the Network Mitigates the Impact of Data Heterogeneity on FedAvg

Abstract:
Federated learning (FL) enables decentralized clients to train a model collaboratively without sharing local data. A key distinction between FL and centralized learning is that clients' data are non-independent and identically distributed, which poses significant challenges in training a global model that generalizes well across heterogeneous local data distributions. In this paper, we analyze the convergence of overparameterized FedAvg with gradient descent (GD). We prove that the impact of data heterogeneity diminishes as the width of neural networks increases, ultimately vanishing when the width approaches infinity. In the infinite-width regime, we further prove that both the global and local models in FedAvg behave as linear models, and that FedAvg achieves the same generalization performance as centralized learning with the same number of GD iterations. Extensive experiments validate our theoretical findings across various network architectures, loss functions, and optimization methods.

Paperid:1814 poster

Authors:Jeet Mohapatra, Nima Dehmamy, Csaba Both, Subhro Das, Tommi Jaakkola

Title: Symmetry-Driven Discovery of Dynamical Variables in Molecular Simulations

Abstract:
We introduce a novel approach for discovering effective degrees of freedom (DOF) in molecular dynamics simulations by mapping the DOF to approximate symmetries of the energy landscape. Unlike most existing methods, we do not require trajectory data but instead rely on knowledge of the forcefield (energy function) around the initial state. We present a scalable symmetry loss function compatible with existing force-field frameworks and a Hessian-based method efficient for smaller systems. Our approach enables systematic exploration of conformational space by connecting structural dynamics to energy landscape symmetries. We apply our method to two systems, Alanine dipeptide and Chignolin, recovering their known important conformations. Our approach can prove useful for efficient exploration in molecular simulations with potential applications in protein folding and drug discovery.

Paperid:1815 poster

Authors:Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han

Title: Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

Abstract:
Traditional robust methods in multi-agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering system-wide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL. Our code is available athttps://github.com/sunwoolee0504/WALL.

Paperid:1816 poster

Authors:Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou

Title: Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Abstract:
Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool inMechanistic Interpretability (MI). Unlike previous studies (Prakash et al. 2024, Chhabra et al. 2024) that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, bringing the setup closer to real-world scenarios. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, contrasting with previous work (Prakash et al. 2024, Chhabra et al. 2024) that reported only small circuit additions after fine-tuning. Based on these observations, we develop acircuit-aware Low-Rank Adaptation (LoRA)method that assigns ranks to layers according to edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA achieves an average improvement of 2.46% over standard LoRA with comparable parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, offering new insights into task design and deepening our understanding of circuit dynamics and fine-tuning mechanisms.

Paperid:1817 poster

Authors:Fuying Wang, Jiacheng Xu, Lequan Yu

Title: From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining

Abstract:
Electrocardiograms (ECGs) play a vital role in monitoring cardiac health and diagnosing heart diseases. However, traditional deep learning approaches for ECG analysis rely heavily on large-scale manual annotations, which are both time-consuming and resource-intensive to obtain. To overcome this limitation, self-supervised learning (SSL) has emerged as a promising alternative, enabling the extraction of robust ECG representations that can be efficiently transferred to various downstream tasks. While previous studies have explored SSL for ECG pretraining and multi-modal ECG-language alignment, they often fail to capture the multi-scale nature of ECG signals. As a result, these methods struggle to learn generalized representations due to their inability to model the hierarchical structure of ECG data. To address this gap, we introduce MELP, a novel Multi-scale ECG-Language Pretraining (MELP) model that fully leverages hierarchical supervision from ECG-text pairs. MELP first pretrains a cardiology-specific language model to enhance its understanding of clinical text. It then applies three levels of cross-modal supervision—at the token, beat, and rhythm levels—to align ECG signals with textual reports, capturing structured information across different time scales. We evaluate MELP on three public ECG datasets across multiple tasks, including zero-shot ECG classification, linear probing, and transfer learning. Experimental results demonstrate that MELP outperforms existing SSL methods, underscoring its effectiveness and adaptability across diverse clinical applications. Our code is available athttps://github.com/HKU-MedAI/MELP.

Paperid:1818 poster

Authors:Yao Xiao, Binbin Yang, Weiyan Chen, Jiahao Chen, Zijie Cao, ZiYi Dong, Xiangyang Ji, Liang Lin, Wei Ke, Pengxu Wei

Title: Are High-Quality AI-Generated Images More Difficult for Models to Detect?

Abstract:
The remarkable evolution of generative models has enabled the generation of high-quality, visually attractive images, often perceptually indistinguishable from real photographs to human eyes. This has spurred significant attention on AI-generated image (AIGI) detection. Intuitively, higher image quality should increase detection difficulty. However, our systematic study on cutting-edge text-to-image generators reveals a counterintuitive finding: AIGIs with higher quality scores, as assessed by human preference models, tend to be more easily detected by existing models. To investigate this, we examine how the text prompts for generation and image characteristics influence both quality scores and detector accuracy. We observe that images from short prompts tend to achieve higher preference scores while being easier to detect. Furthermore, through clustering and regression analyses, we verify that image characteristics like saturation, contrast, and texture richness collectively impact both image quality and detector accuracy. Finally, we demonstrate that the performance of off-the-shelf detectors can be enhanced across diverse generators and datasets by selecting input patches based on the predicted scores of our regression models, thus substantiating the broader applicability of our findings. Code and data are available at \href{https://github.com/Coxy7/AIGI-Detection-Quality-Paradox}{GitHub}.

Paperid:1819 poster

Authors:Weihan Li, Linyun Zhou, YangJian, Shengxuming Zhang, Xiangtong Du, Xiuming Zhang, Jing Zhang, ChaoqingXu, Mingli Song, Zunlei Feng

Title: L-Diffusion: Laplace Diffusion for Efficient Pathology Image Segmentation

Abstract:
Pathology image segmentation plays a pivotal role in artificial digital pathology diagnosis and treatment. Existing approaches to pathology image segmentation are hindered by labor-intensive annotation processes and limited accuracy in tail-class identification, primarily due to the long-tail distribution inherent in gigapixel pathology images. In this work, we introduce the Laplace Diffusion Model, referred to as L-Diffusion, an innovative framework tailored for efficient pathology image segmentation. L-Diffusion utilizes multiple Laplace distributions, as opposed to Gaussian distributions, to model distinct components—a methodology supported by theoretical analysis that significantly enhances the decomposition of features within the feature space. A sequence of feature maps is initially generated through a series of diffusion steps. Following this, contrastive learning is employed to refine the pixel-wise vectors derived from the feature map sequence. By utilizing these highly discriminative pixel-wise vectors, the segmentation module achieves a harmonious balance of precision and robustness with remarkable efficiency. Extensive experimental evaluations demonstrate that L-Diffusion attains improvements of up to 7.16%, 26.74%, 16.52%, and 3.55% on tissue segmentation datasets, and 20.09%, 10.67%, 14.42%, and 10.41% on cell segmentation datasets, as quantified by DICE, MPA, mIoU, and FwIoU metrics. The source are available athttps://github.com/Lweihan/LDiffusion.

Paperid:1820 poster

Authors:Zhihua Liu, Amrutha Saseendran, Lei Tong, Xilin He, Fariba Yousefi, Nikolay Burlutskiy, Dino Oglic, Tom Diethe, Philip Alexander Teare, Huiyu Zhou, Chen Jin

Title: Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

Abstract:
Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates ormask prompts, which are then refined into targeted object masks. Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts, and significant improvements in segmentation accuracy. The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf, which is the most complex open-set grounded segmentation task in the field.

Paperid:1821 poster

Authors:Liangliang Shi, Shenhui Zhang, Xingbo Du, Nianzu Yang, Junchi Yan

Title: DSBRouter: End-to-end Global Routing via Diffusion Schr\"{o}dinger Bridge

Abstract:
Global routing (GR) is a fundamental task in modern chip design and various learning techniques have been devised. However, a persistent challenge is the inherent lack of a mechanism to guarantee the routing connectivity in network's prediction results, necessitating post-processing search or reinforcement learning (RL) to enforce the connectivity. In this paper, we propose a neural GR solver called DSBRouter, leveraging the Diffusion Schr"{o}dinger Bridge (DSB) model for GR. During training, unlike previous works that learn the mapping from noise to routes, we establish a bridge between the initial pins and the routing via DSB, which learns the forward and backward mapping between them. For inference, based on the evaluation metric (e.g. low overflow), we further introduce a sampling scheme with evaluation-based guidance to enhance the routing predictions. Note that DSBRouter is an end-to-end model that does not require a post-step to ensure connectivity. Empirical results show that it achieves SOTA performance on the overflow reduction in ISPD98 and part of ISPD07. In some cases, DSBRouter can even generate routes with zero overflow.

Paperid:1822 poster

Authors:Tian Bian, Yifan Niu, Chaohao Yuan, Chengzhi Piao, Bingzhe Wu, Long-Kai Huang, Yu Rong, Tingyang Xu, Hong Cheng, Jia Li

Title: IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck

Abstract:
Circuit discovery has recently attracted attention as a potential research direction to explain the non-trivial behaviors of language models. It aims to find the computational subgraphs, also known as circuits, within the model that are responsible for solving specific tasks. However, most existing studies overlook the holistic nature of these circuits and require designing specific corrupted activations for different tasks, which is inaccurate and inefficient. In this work, we propose an end-to-end approach based on the principle of Information Bottleneck, called IBCircuit, to holistically identify informative circuits. In contrast to traditional causal interventions, IBCircuit is an optimization framework for holistic circuit discovery and can be applied to any given task without tediously corrupted activation design. In both the Indirect Object Identification (IOI) and Greater-Than tasks, IBCircuit identifies more faithful and minimal circuits in terms of critical node components and edge components compared to recent related work.

Paperid:1823 poster

Authors:Minqi Yu, Jinduo Liu, Junzhong Ji

Title: Causal Invariance-aware Augmentation for Brain Graph Contrastive Learning

Abstract:
Deep models are increasingly used to analyze brain graphs for the diagnosis and understanding of brain diseases. However, due to the multi-site data aggregation and individual differences, brain graph datasets exhibit widespread distribution shifts, which impair the model’s generalization ability to the test set, thereby limiting the performance of existing methods. To address these issues, we propose a Causally Invariance-aware Augmentation for brain Graph Contrastive Learning, called CIA-GCL. This method first generates a brain graph by extracting node features based on the topological structure. Then, a learnable brain invariant subgraph is identified based on a causal decoupling approach to capture the maximum label-related invariant information with invariant learning. Around this invariant subgraph, we design a novel invariance-aware augmentation strategy to generate meaningful augmented samples for graph contrast learning. Finally, the extracted invariant subgraph is utilized for brain disease classification, effectively mitigating distribution shifts while also identifying critical local graph structures, enhancing the model’s interpretability. Experiments on three real-world brain disease datasets demonstrate that our method achieves state-of-the-art performance, effectively generalizes to multi-site brain datasets, and provides certain interpretability.

Paperid:1824 poster

Authors:Haofei Lu, Yifei Shen, Dongsheng Li, Junliang Xing, Dongqi Han

Title: Habitizing Diffusion Planning for Efficient and Effective Decision Making

Abstract:
Diffusion models have shown great promise in decision-making, also known as diffusion planning. However, the slow inference speeds limit their potential for broader real-world applications. Here, we introduceHabi, a general framework that transforms powerful but slow diffusion planning models into fast decision-making models, which mimics the cognitive process in the brain that costly goal-directed behavior gradually transitions to efficient habitual behavior with repetitive practice.Even using a laptop CPU, the habitized model can achieve an average800+ Hzdecision-making frequency (faster than previous diffusion planners by orders of magnitude) on standard offline reinforcement learning benchmarks D4RL, while maintaining comparable or even higher performance compared to its corresponding diffusion planner. Our work proposes a fresh perspective of leveraging powerful diffusion models for real-world decision-making tasks. We also provide robust evaluations and analysis, offering insights from both biological and engineering perspectives for efficient and effective decision-making.

Paperid:1825 poster

Authors:Fengbin Guan, Xin Li, Zihao Yu, Yiting Lu, Zhibo Chen

Title: QMamba: On First Exploration of Vision Mamba for Image Quality Assessment

Abstract:
In this work, we take the first exploration of the recently popular foundation model,i.e.,State Space Model/Mamba, in image quality assessment (IQA), aiming at observing and excavating the perception potential in vision Mamba. A series of works on Mamba has shown its significant potential in various fields,e.g.,segmentation and classification. However, the perception capability of Mamba remains under-explored. Consequently, we propose QMamba by revisiting and adapting the Mamba model for three crucial IQA tasks,i.e.,task-specific, universal, and transferable IQA, which reveals its clear advantages over existing foundational models,e.g.,Swin Transformer, ViT, and CNNs, in terms of perception and computational cost. To improve the transferability of QMamba, we propose the StylePrompt tuning paradigm, where lightweight mean and variance prompts are injected to assist task-adaptive transfer learning of pre-trained QMamba for different downstream IQA tasks. Compared with existing prompt tuning strategies, our StylePrompt enables better perceptual transfer with lower computational cost. Extensive experiments on multiple synthetic, authentic IQA datasets, and cross IQA datasets demonstrate the effectiveness of our proposed QMamba.

Paperid:1826 poster

Authors:Shengsheng Lin, Haojun Chen, Haijie Wu, Chunyun Qiu, Weiwei Lin

Title: Temporal Query Network for Efficient Multivariate Time Series Forecasting

Abstract:
Sufficiently modeling the correlations among variables (aka channels) is crucial for achieving accurate multivariate time series forecasting (MTSF). In this paper, we propose a novel technique called Temporal Query (TQ) to more effectively capture multivariate correlations, thereby improving model performance in MTSF tasks. Technically, the TQ technique employs periodically shifted learnable vectors as queries in the attention mechanism to capture global inter-variable patterns, while the keys and values are derived from the raw input data to encode local, sample-level correlations. Building upon the TQ technique, we develop a simple yet efficient model named Temporal Query Network (TQNet), which employs only a single-layer attention mechanism and a lightweight multi-layer perceptron (MLP). Extensive experiments demonstrate that TQNet learns more robust multivariate correlations, achieving state-of-the-art forecasting accuracy across 12 challenging real-world datasets. Furthermore, TQNet achieves high efficiency comparable to linear-based methods even on high-dimensional datasets, balancing performance and computational cost. The code is available at:https://github.com/ACAT-SCUT/TQNet.

Paperid:1827 poster

Authors:Yu Wang, Shiwei Chen

Title: Learning Event Completeness for Weakly Supervised Video Anomaly Detection

Abstract:
Weakly supervised video anomaly detection (WS-VAD) is tasked with pinpointing temporal intervals containing anomalous events within untrimmed videos, utilizing only video-level annotations. However, a significant challenge arises due to the absence of dense frame-level annotations, often leading to incomplete localization in existing WS-VAD methods. To address this issue, we present a novel LEC-VAD, Learning Event Completeness for Weakly Supervised Video Anomaly Detection, which features a dual structure designed to encode both category-aware and category-agnostic semantics between vision and language. Within LEC-VAD, we devise semantic regularities that leverage an anomaly-aware Gaussian mixture to learn precise event boundaries, thereby yielding more complete event instances. Besides, we develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories. This innovation bolsters the text's expressiveness, which is crucial for advancing WS-VAD. Our LEC-VAD demonstrates remarkable advancements over the current state-of-the-art methods on two benchmark datasets XD-Violence and UCF-Crime.

Paperid:1828 poster

Authors:Yongxian Wei, Anke Tang, Li Shen, Zixuan Hu, Chun Yuan, Xiaochun Cao

Title: Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

Abstract:
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental target of model merging: the merged model performs as closely as possible to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($\textit{i.e.}$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $\textit{data-free}$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $\textit{shared subspace}$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.

Paperid:1829 poster

Authors:Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

Title: Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Abstract:
Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of language models (LLMs). However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify�?computations in any sense, only latent activations. To solve this, we propose Jacobian sparse autoencoders (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. Our key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.

Paperid:1830 poster

Authors:Bowen Deng, Lele Fu, Jialong Chen, Sheng Huang, Tianchi Liao, Zhang Tao, Chuan Chen

Title: Towards Understanding Parametric Generalized Category Discovery on Graphs

Abstract:
Generalized Category Discovery (GCD) aims to identify both known and novel categories in unlabeled data by leveraging knowledge from old classes. However, existing methods are limited to non-graph data; lack theoretical foundations to answerWhen and how known classes can help GCD. We introduce the Graph GCD task; provide the first rigorous theoretical analysis ofparametric GCD. By quantifying the relationship between old and new classes in the embedding space using the Wasserstein distance W, we derive the first provable GCD loss bound based on W. This analysis highlights two necessary conditions for effective GCD. However, we uncover, through a Pairwise Markov Random Field perspective, that popular graph contrastive learning (GCL) methods inherently violate these conditions. To address this limitation, we propose SWIRL, a novel GCL method for GCD. Experimental results validate our (theoretical) findings and demonstrate SWIRL's effectiveness.

Paperid:1831 poster

Authors:Saksham Rastogi, Pratyush Maini, Danish Pruthi

Title: STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings

Abstract:
Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership—i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and utility of the original data. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.

Paperid:1832 poster

Authors:Alexander Nikulin, Ilya Zisman, Denis Tarasov, Lyubaykin Nikita, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov

Title: Latent Action Learning Requires Supervision in the Presence of Distractors

Abstract:
Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by4.2xon average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.

Paperid:1833 poster

Authors:Guixiang Wang, Jianjun Li

Title: Large Displacement Motion Transfer with Unsupervised Anytime Interpolation

Abstract:
Motion transfer is to transfer pose in driving video to object of source image, so that object of source image moves. Although great progress has been made recently in unsupervised motion transfer, many unsupervised methods still struggle to accurately model large displacement motions when large motion differences occur between source and driving images. To solve the problem, we propose an unsupervised anytime interpolation based large displacement motion transfer method, which can generate a series of anytime interpolated images between source and driving images. By decomposing large displacement motion into many small displacement motions, difficulty of large displacement motion estimation is reduced. In the process, we design a selector that can select optimal interpolated image from generated interpolated images for downstream tasks. Since there are no real images as labels in the interpolation process, we propose a bidirectional training strategy. Some constraints are added to optimal interpolated image to generate a reasonable interpolated image. To encourage network to generate high-quality images, a pre-trained Vision Transformer model is used to design constraint losses. Finally, experiments show that compared with the large displacement motion between source and driving images, small displacement motion between interpolated and driving images is easier to realize motion transfer. Compared with existing state-of-art methods, our method has significant improvements in motion-related metrics.

Paperid:1834 poster

Authors:Tao Jin, Yue Wu, Quanquan Gu, Farzad Farnoud

Title: Ranking with Multiple Oracles: From Weak to Strong Stochastic Transitivity

Abstract:
We study the problem of efficiently aggregating the preferences of items from multiple information sources (oracles) and infer the ranking under both the weak stochastic transitivity (WST) and the strong stochastic transitivity (SST) conditions. When the underlying preference model satisfies the WST condition, we propose an algorithm named RMO-WST, which has a bi-level design: at the higher level, it actively allocates comparison budgets to all undetermined pairs until the full ranking is recovered; at the lower level, it attempts to compare the pair of items and selects the more accurate oracles simultaneously. We prove that the sample complexity of RMO-WST is $ \tilde O( N\sum_{i=2}^{N}H_{\sigma^{-1}(i),{\sigma^{-1}(i-1)}} )$, where $N$ is the number of items to rank, $H$ is a problem-dependent hardness factor, and $\sigma^{-1}(i)$ represents the $i$-th best item. We also provide a tight lower bound that matches the upper bound of approximate ranking under the WST condition, answering a previously open problem. In addition, when the SST condition is satisfied, we propose an algorithm named RMO-SST, which can achieve an $\tilde{O}(\sum_{i=1}^{N} H_i \log(N))$ sample complexity. This outperforms the best-known sample complexity by a factor of $\log(N)$. The theoretical advantages of our algorithms are verified by empirical experiments in a simulated environment.

Paperid:1835 poster

Authors:Harit Vishwakarma, Alan Mishler, Thomas Cook, Niccolo Dalmasso, Natraj Raman, Sumitra Ganesh

Title: Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction

Abstract:
Large language models (LLMs) are empowering decision-making in several applications, including tool or API usage and answering multiple-choice questions (MCQs). However, incorrect outputs pose significant risks in high-stakes domains like healthcare and finance. To quantify LLM uncertainty and thereby mitigate these risks, recent works employ conformal prediction (CP), a model- and distribution-agnostic framework that uses LLM outputs to generate a \emph{prediction set} containing the true answer with high probability. Leveraging CP, we propose \emph{conformal revision of questions} (CROQ), which revises the question by narrowing down the available choices to those in the prediction set and asking the LLM the revised question. We expect LLMs to be more accurate on revised questions with fewer choices. Furthermore, we expect CROQ to be effective when the prediction sets from CP are small. Commonly used logit scores often lead to large sets, diminishing CROQ's effectiveness. To overcome this, we propose CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Our extensive experiments on MMLU, ToolAlpaca, and TruthfulQA datasets with multiple LLMs show that CROQ improves accuracy over the standard inference, with more pronounced gains when paired with CP-OPT.

Paperid:1836 poster

Authors:Chao Cui, Shisong Tang, Fan Li, Jiechao Gao, Hechang Chen

Title: Calibrating Video Watch-time Predictions with Credible Prototype Alignment

Abstract:
Accurately predicting user watch-time is crucial for enhancing user stickiness and retention in video recommendation systems. Existing watch-time prediction approaches typically involve transformations of watch-time labels for prediction and subsequent reversal, ignoring both the natural distribution properties of label and the \textit{instance representation confusion} that results in inaccurate predictions. In this paper, we propose ProWTP, a two-stage method combining prototype learning and optimal transport for watch-time regression prediction, suitable for any deep recommendation model. Specifically, we observe that the watch-ratio (the ratio of watch-time to video duration) within the same duration bucket exhibits a multimodal distribution. To facilitate incorporation into models, we use a hierarchical vector quantised variational autoencoder (HVQ-VAE) to convert the continuous label distribution into a high-dimensional discrete distribution, serving as credible prototypes for calibrations. Based on this, ProWTP views the alignment between prototypes and instance representations as a Semi-relaxed Unbalanced Optimal Transport (SUOT) problem, where the marginal constraints of prototypes are relaxed. And the corresponding optimization problem is reformulated as a weighted Lasso problem for solution. Moreover, ProWTP introduces the assignment and compactness losses to encourage instances to cluster closely around their respective prototypes, thereby enhancing the prototype-level distinguishability. Finally, we conducted extensive offline experiments on two industrial datasets, demonstrating our consistent superiority in real-world application.

Paperid:1837 poster

Authors:Alessandro Montenegro, Marco Mussi, Matteo Papini, Alberto Maria Metelli

Title: Convergence Analysis of Policy Gradient Methods with Dynamic Stochasticity

Abstract:
Policy gradient(PG) methods are effectivereinforcement learning(RL) approaches, particularly for continuous problems. While they optimize stochastic (hyper)policies via action- or parameter-space exploration, real-world applications often require deterministic policies. Existing PG convergence guarantees to deterministic policies assume a fixed stochasticity in the (hyper)policy, tuned according to the desired final suboptimality, whereas practitioners commonly use a dynamic stochasticity level. This work provides the theoretical foundations for this practice. We introduce PES, a phase-based method that reduces stochasticity via a deterministic schedule while running PG subroutines with fixed stochasticity in each phase. Under gradient domination assumptions, PES achieves last-iterate convergence to the optimal deterministic policy with a sample complexity of order $\widetilde{\mathcal{O}}(\epsilon^{-5})$. Additionally, we analyze the common practice, termed SL-PG, of jointly learning stochasticity (via an appropriate parameterization) and (hyper)policy parameters. We show that SL-PG also ensures last-iterate convergence with a rate $\widetilde{\mathcal{O}}(\epsilon^{-3})$, but to the optimal stochastic (hyper)policy only, requiring stronger assumptions compared to PES.

Paperid:1838 poster

Authors:Alireza Amiri Bavandpour, Xinting Huang, Mark Rofin, Michael Hahn

Title: Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

Abstract:
Chain-of-thought reasoning and scratchpads have emerged as critical tools for enhancing the computational capabilities of transformers. While theoretical results show that polynomial-length scratchpads can extend transformers' expressivity from $TC^0$ to $PTIME$, their required length remains poorly understood. Empirical evidence even suggests that transformers need scratchpads even for many problems in $TC^0$, such as Parity or Multiplication, challenging optimistic bounds derived from circuit complexity. In this work, we initiate the study of systematic lower bounds for the number of CoT steps across different algorithmic problems, in the hard-attention regime. We study a variety of algorithmic problems, and provide bounds that are tight up to logarithmic factors. Overall, these results contribute to emerging understanding of the power and limitations of chain-of-thought reasoning.

Paperid:1839 poster

Authors:Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang

Title: Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

Abstract:
User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024) , COCO (Lin et al., 2014) and DesignBench, a benchmark we curated with strong artistic and design elements. Experiments over the three datasets demonstrate the proposed T2I agents' ability to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard T2I generation. Moreover, we conducted human studies and observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow, highlighting the effectiveness of our approach. Code and DesignBench can be found athttps://github.com/google-deepmind/proactive_t2i_agents.

Paperid:1840 poster

Authors:Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini

Title: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Abstract:
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.

Paperid:1841 poster

Authors:Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou

Title: SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Abstract:
Despite the dominance of convolutional and transformer-based architectures in image-to-image retrieval, these models are prone to biases arising from low-level visual features, such as color. Recognizing the lack of semantic understanding as a key limitation, we propose a novel scene graph-based retrieval framework that emphasizes semantic content over superficial image characteristics. Prior approaches to scene graph retrieval predominantly rely on supervised Graph Neural Networks (GNNs), which require ground truth graph pairs driven from image captions. However, the inconsistency of caption-based supervision stemming from variable text encodings undermine retrieval reliability. To address these, we presentSCENIR, a Graph Autoencoder-based unsupervised retrieval framework, which eliminates the dependence on labeled training data. Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate forGraph Edit Distance(GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval. The source code is available athttps://github.com/nickhaidos/scenir-icml2025.

Paperid:1842 poster

Authors:Jianqing Liang, Zhiqiang Li, Xinkai Wei, Yuan Liu, Zhiqiang Wang

Title: ML2-GCL: Manifold Learning Inspired Lightweight Graph Contrastive Learning

Abstract:
Graph contrastive learning has attracted great interest as a dominant and promising self-supervised representation learning approach in recent years. While existing works follow the basic principle of pulling positive pairs closer and pushing negative pairs far away, they still suffer from several critical problems, such as the underlying semantic disturbance brought by augmentation strategies, the failure of GCN in capturing long-range dependence, rigidness and inefficiency of node sampling techniques. To address these issues, we propose Manifold Learning Inspired Lightweight Graph Contrastive Learning (ML2-GCL), which inherits the merits of both manifold learning and GCN. ML2-GCL avoids the potential risks of semantic disturbance with only one single view. It achieves global nonlinear structure recovery from locally linear fits, which can make up for the defects of GCN. The most amazing advantage is about the lightweight due to its closed-form solution of positive pairs weights and removal of pairwise distances calculation. Theoretical analysis proves the existence of the optimal closed-form solution. Extensive empirical results on various benchmarks and evaluation protocols demonstrate effectiveness and lightweight of ML2-GCL. We release the code athttps://github.com/a-hou/ML2-GCL.

Paperid:1843 poster

Authors:Shuo Wang, Shunyang Huang, Jinghui Yuan, Zhixiang Shen, zhao kang

Title: Cooperation of Experts: Fusing Heterogeneous Information with Large Margin

Abstract:
Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing approaches often fail to account for the inherent heterogeneity of object patterns across different semantic spaces. To address this limitation, we propose theCooperation of Experts (CoE)framework, which encodes multi-typed information into unified heterogeneous multiplex networks. By transcending modality and connection differences, CoE provides a powerful and flexible model for capturing the intricate structures of real-world complex data. In our framework, dedicated encoders act as domain-specific experts, each specializing in learning distinct relational patterns in specific semantic spaces. To enhance robustness and extract complementary knowledge, these experts collaborate through a novellarge marginmechanism supported by a tailored optimization strategy. Rigorous theoretical analyses guarantee the framework’s feasibility and stability, while extensive experiments across diverse benchmarks demonstrate its superior performance and broad applicability.

Paperid:1844 poster

Authors:Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei

Title: Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Abstract:
Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the additional guidance to strengthen the next autoregressive modeling phase by shaping global structural awareness of typical dense image tokens. A new Diffusion Transformer head is further devised to amplify the global context among all tokens for mask token prediction. Extensive evaluations on both class-conditional and text-to-image generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs.

Paperid:1845 poster

Authors:Shu Wei, Yanjie Li, Lina Yu, Weijun Li, Min Wu, Linjun Sun, Jingyi Liu, Hong Qin, Yusong Deng, Jufeng Han, Yan Pang

Title: Closed-form Solutions: A New Perspective on Solving Differential Equations

Abstract:
The quest for analytical solutions to differential equations has traditionally been constrained by the need for extensive mathematical expertise. Machine learning methods like genetic algorithms have shown promise in this domain, but are hindered by significant computational time and the complexity of their derived solutions. This paper introducesSSDE(Symbolic Solver for Differential Equations), a novel reinforcement learning-based approach that derives symbolic closed-form solutions for various differential equations. Evaluations across a diverse set of ordinary and partial differential equations demonstrate that SSDE outperforms existing machine learning methods, delivering superior accuracy and efficiency in obtaining analytical solutions.

Paperid:1846 poster

Authors:Shahaf E. Finder, Ron Shapira Weber, Moshe Eliasof, Oren Freifeld, Eran Treister

Title: Improving the Effective Receptive Field of Message-Passing Neural Networks

Abstract:
Message-Passing Neural Networks (MPNNs) have become a cornerstone for processing and analyzing graph-structured data. However, their effectiveness is often hindered by phenomena such as over-squashing, where long-range dependencies or interactions are inadequately captured and expressed in the MPNN output. This limitation mirrors the challenges of the Effective Receptive Field (ERF) in Convolutional Neural Networks (CNNs), where the theoretical receptive field is underutilized in practice. In this work, we show and theoretically explain the limited ERF problem in MPNNs. Furthermore, inspired by recent advances in ERF augmentation for CNNs, we propose an Interleaved Multiscale Message-Passing Neural Networks (IM-MPNN) architecture to address these problems in MPNNs. Our method incorporates a hierarchical coarsening of the graph, enabling message-passing across multiscale representations and facilitating long-range interactions without excessive depth or parameterization. Through extensive evaluations on benchmarks such as the Long-Range Graph Benchmark (LRGB), we demonstrate substantial improvements over baseline MPNNs in capturing long-range dependencies while maintaining computational efficiency.

Paperid:1847 poster

Authors:Yao Shu, Wenyang Hu, See-Kiong Ng, Bryan Kian Hsiang Low, Fei Yu

Title: Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

Abstract:
Large Language Models (LLMs) have become indispensable in numerous real-world applications. However, fine-tuning these models at scale, especially in federated settings where data privacy and communication efficiency are critical, presents significant challenges. Existing approaches often resort to parameter-efficient fine-tuning (PEFT) to mitigate communication overhead, but this typically comes at the cost of model accuracy. To this end, we proposefederated full-parameter tuning at scale for LLMs(Ferret),the first first-order method with shared randomnessto enable scalable full-parameter tuning of LLMs across decentralized data sources while maintaining competitive model accuracy. Ferret accomplishes this through three aspects:(i)it employs widely used first-order methods for efficient local updates;(ii)it projects these updates into a low-dimensional space to considerably reduce communication overhead; and(iii)it reconstructs local updates from this low-dimensional space with shared randomness to facilitate effective full-parameter global aggregation, ensuring fast convergence and competitive final performance. Our rigorous theoretical analyses and insights along with extensive experiments, show that Ferret significantly enhances the scalability of existing federated full-parameter tuning approaches by achieving high computational efficiency, reduced communication overhead, and fast convergence, all while maintaining competitive model accuracy. Our implementation is available athttps://github.com/allen4747/Ferret.

Paperid:1848 poster

Authors:Hui Dai, Ryan Teehan, Mengye Ren

Title: Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

Abstract:
Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of a static set of questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates. Code and data are available athttps://agenticlearning.ai/daily-oracle.

Paperid:1849 poster

Authors:Zhe Zhang, Mingxiu Cai, Hanxiao Wang, Gaochang Wu, Tianyou Chai, Xiatian Zhu

Title: CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

Abstract:
Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples. Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach CostFilter-AD. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released athttps://github.com/ZHE-SAPI/CostFilter-AD.

Paperid:1850 poster

Authors:Yiheng Li, Cunxin Fan, Chongjian GE, Seth Z. Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, Mingyu Ding, Wei Zhan

Title: WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving

Abstract:
Language models uncover unprecedented abilities in analyzing driving scenarios, owing to their limitless knowledge accumulated from text-based pre-training. Naturally, they should particularly excel in analyzing rule-based interactions, such as those triggered by traffic laws, which are well documented in texts. However, such interaction analysis remains underexplored due to the lack of dedicated language datasets that address it. Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents' interactions, behaviors, and intentions. To showcase the applications of WOMD-Reasoning, we design Motion-LLaVA, a motion-language model fine-tuned on WOMD-Reasoning. Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLaVA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. The dataset and its vision modal extension are available onhttps://waymo.com/open/download/. The codes & prompts to build it are available onhttps://github.com/yhli123/WOMD-Reasoning.

Paperid:1851 poster

Authors:Chengbo Zang, Mehmet Kerem Turkcan, Gil Zussman, Zoran Kostic, Javad Ghaderi

Title: Adaptive Data Collection for Robust Learning Across Multiple Distributions

Abstract:
We propose a framework for adaptive data collection aimed at robust learning in multi-distribution scenarios under a fixed data collection budget. In each round, the algorithm selects a distribution source to sample from for data collection and updates the model parameters accordingly. The objective is to find the model parameters that minimize the expected loss across all the data sources. Our approach integrates upper-confidence-bound (UCB) sampling with online gradient descent (OGD) to dynamically collect and annotate data from multiple sources. By bridging online optimization and multi-armed bandits, we provide theoretical guarantees for our UCB-OGD approach, demonstrating that it achieves a minimax regret of $O(T^{\frac{1}{2}}(K\ln T)^{\frac{1}{2}})$ over $K$ data sources after $T$ rounds. We further provide a lower bound showing that the result is optimal up to a $\ln T$ factor. Extensive evaluations on standard datasets and a real-world testbed for object detection in smart-city intersections validate the consistent performance improvements of our method compared to baselines such as random sampling and various active learning methods.

Paperid:1852 poster

Authors:Michael Munn, Susan Wei

Title: A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

Abstract:
Recent advances in artificial intelligence have been fueled by the development of foundation models such as BERT, GPT, T5, and Vision Transformers. These models are first pretrained on vast and diverse datasets and then adapted to specific downstream tasks, often with significantly less data. However, the mechanisms behind the success of this ubiquitous pretrain-then-adapt paradigm remain underexplored, particularly the characteristics of pretraining checkpoints that enhance downstream adaptation. We introduce a Bayesian model selection criterion, called the downstream free energy, which quantifies a checkpoint’s adaptability by measuring the concentration of nearby favorable parameters for a downstream task. We demonstrate that this Bayesian model selection criterion can be effectively implemented without access to the downstream data or prior knowledge of the downstream task. Furthermore, we provide empirical evidence that the criterion reliably correlates with improved fine-tuning performance, offering a principled approach to predicting model adaptability.

Paperid:1853 poster

Authors:Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

Title: Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Abstract:
Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.

Paperid:1854 poster

Authors:Chenning Yu, Sicun Gao

Title: Improving Compositional Generation with Diffusion Models Using Lift Scores

Abstract:
We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at github.com/rainorangelemon/complift.

Paperid:1855 poster

Authors:KaShun SHUM, Yuzhen Huang, Hongjian Zou, dingqi, YiXuan Liao, Xiaoxin Chen, Qian Liu, Junxian He

Title: Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Abstract:
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmarks (Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning, which shares similar intuition with Thrush et al. (2024). To leverage this insight, we introduce predictive data selection (PreSelect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets athttps://github.com/hkust-nlp/PreSelect.

Paperid:1856 poster

Authors:Ziang Zhou, Zhihao Ding, Jieming Shi, Li Qing, Shiqi Shen

Title: TINED: GNNs-to-MLPs by Teacher Injection and Dirichlet Energy Distillation

Abstract:
Graph Neural Networks (GNNs) are pivotal in graph-based learning, particularly excelling in node classification. However, their scalability is hindered by the need for multi-hop data during inference, limiting their application in latency-sensitive scenarios. Recent efforts to distill GNNs into multi-layer perceptrons (MLPs) for faster inference often underutilize the layer-level insights of GNNs. In this paper, we present TINED, a novel approach that distills GNNs to MLPs on a layer-by-layer basis using Teacher Injection and Dirichlet Energy Distillation techniques. We focus on two key operations in GNN layers: feature transformation (FT) and graph propagation (GP). We recognize that FT is computationally equivalent to a fully-connected (FC) layer in MLPs. Thus, we propose directly transferring teacher parameters from an FT in a GNN to an FC layer in the student MLP, enhanced by fine-tuning. In TINED, the FC layers in an MLP replicate the sequence of FTs and GPs in the GNN. We also establish a theoretical bound for GP approximation. Furthermore, we note that FT and GP operations in GNN layers often exhibit opposing smoothing effects: GP is aggressive, while FT is conservative. Using Dirichlet energy, we develop a DE ratio to measure these effects and propose Dirichlet Energy Distillation to convey these characteristics from GNN layers to MLP layers. Extensive experiments show that TINED outperforms GNNs and leading distillation methods across various settings and seven datasets. Source code are available athttps://github.com/scottjiao/TINED_ICML25/.

Paperid:1857 poster

Authors:Lucas Rosenblatt, Yuliia Lut, Ethan Turok, Marco Avella Medina, Rachel Cummings

Title: Differential Privacy Under Class Imbalance: Methods and Empirical Insights

Abstract:
Imbalanced learning occurs in classification settings where the distribution of class-labels is highly skewed in the training data, such as when predicting rare diseases or in fraud detection. This class imbalance presents a significant algorithmic challenge, which can be further exacerbated when privacy-preserving techniques such as differential privacy are applied to protect sensitive training data. Our work formalizes these challenges and provides a number of algorithmic solutions. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance, alongside DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy. Finally, we empirically evaluate these privacy-preserving imbalanced learning methods under various data and distributional settings. We find that private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.

Paperid:1858 poster

Authors:Qian Peng, Yajie Bao, Haojie Ren, Zhaojun Wang, Changliang Zou

Title: Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach

Abstract:
Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of cellwise outliers. To address this issue, this paper introduces a novel framework calleddetect-then-impute conformal prediction. This framework first employs an outlier detection procedure on the test feature and then utilizes an imputation method to fill in those cells identified as outliers. To quantify the uncertainty in the processed test feature, we adaptively apply the detection and imputation procedures to the calibration set, thereby constructing exchangeable features for the conformal prediction interval of the test label. We develop two practical algorithms,PDI-CPandJDI-CP, and provide a distribution-free coverage analysis under some commonly used detection and imputation procedures. Notably,JDI-CPachieves a finite sample1�?αcoverage guarantee. Numerical experiments on both synthetic and real datasets demonstrate that our proposed algorithms exhibit robust coverage properties and comparable efficiency to the oracle baseline.

Paperid:1859 poster

Authors:Nghiem Tuong Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho

Title: On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Abstract:
LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

Paperid:1860 poster

Authors:Fabio Feser, Marina Evangelou

Title: Dual Feature Reduction for the Sparse-group Lasso and its Adaptive Variant

Abstract:
The sparse-group lasso performs both variable and group selection, simultaneously using the strengths of the lasso and group lasso. It has found widespread use in genetics, a field that regularly involves the analysis of high-dimensional data, due to its sparse-group penalty, which allows it to utilize grouping information. However, the sparse-group lasso can be computationally expensive, due to the added shrinkage complexity, and its additional hyperparameter that needs tuning. This paper presents a novel feature reduction method, Dual Feature Reduction (DFR), that uses strong screening rules for the sparse-group lasso and the adaptive sparse-group lasso to reduce their input space before optimization, without affecting solution optimality. DFR applies two layers of screening through the application of dual norms and subdifferentials. Through synthetic and real data studies, it is shown that DFR drastically reduces the computational cost under many different scenarios.

Paperid:1861 poster

Authors:Kaiyu Guo, Zijian Wang, Tan Pan, Brian C. Lovell, Mahsa Baktashmotlagh

Title: Improving Out-of-Distribution Detection via Dynamic Covariance Calibration

Abstract:
Out-of-Distribution (OOD) detection is essential for the trustworthiness of AI systems. Methods using prior information (i.e., subspace-based methods) have shown effective performance by extracting information geometry to detect OOD data with a more appropriate distance metric. However, these methods fail to address the geometry distorted by ill-distributed samples, due to the limitation of statically extracting information geometry from the training distribution. In this paper, we argue that the influence of ill-distributed samples can be corrected by dynamically adjusting the prior geometry in response to new data. Based on this insight, we propose a novel approach that dynamically updates the prior covariance matrix using real-time input features, refining its information. Specifically, we reduce the covariance along the direction of real-time input features and constrain adjustments to the residual space, thus preserving essential data characteristics and avoiding effects on unintended directions in the principal space. We evaluate our method on two pre-trained models for the CIFAR dataset and five pre-trained models for ImageNet-1k, including the self-supervised DINO model. Extensive experiments demonstrate that our approach significantly enhances OOD detection across various models. The code is released athttps://github.com/workerbcd/ooddcc.

Paperid:1862 poster

Authors:Xinyu Liu, Zixuan Xie, Shangtong Zhang

Title: LinearQ-Learning Does Not Diverge inL2: Convergence Rates to a Bounded Set

Abstract:
Q-learning is one of the most fundamental reinforcement learning algorithms. It is widely believed thatQ-learning with linear function approximation (i.e., linearQ-learning) suffers from possible divergence until the recent work Meyn (2024) which establishes the ultimate almost sure boundedness of the iterates of linearQ-learning. Building on this success, this paper further establishes the firstL2convergence rate of linearQ-learning iterates (to a bounded set). Similar to Meyn (2024), we do not make any modification to the original linearQ-learning algorithm, do not make any Bellman completeness assumption, and do not make any near-optimality assumption on the behavior policy. All we need is anϵ-softmax behavior policy with an adaptive temperature. The key to our analysis is the general result of stochastic approximations under Markovian noise with fast-changing transition functions. As a side product, we also use this general result to establish theL2convergence rate of tabularQ-learning with anϵ-softmax behavior policy, for which we rely on a novel pseudo-contraction property of the weighted Bellman optimality operator.

Paperid:1863 poster

Authors:Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu

Title: Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

Abstract:
Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available athttps://github.com/UCSC-REAL/TokenCleaning.

Paperid:1864 poster

Authors:Guiliang Liu, Yueci Deng, Runyi Zhao, Huayi Zhou, Jian Chen, Jietao Chen, Ruiyan Xu, Yunxin Tai, Kui Jia

Title: DexScale: Automating Data Scaling for Sim2Real Generalizable Robot Control

Abstract:
A critical prerequisite for achieving generalizable robot control is the availability of a large-scale robot training dataset. Due to the expense of collecting realistic robotic data, recent studies explored simulating and recording robot skills in virtual environments. While simulated data can be generated at higher speeds, lower costs, and larger scales, the applicability of such simulated data remains questionable due to the gap between simulated and realistic environments. To advance the Sim2Real generalization, in this study, we present DexScale, a data engine designed to perform automatic skills simulation and scaling for learning deployable robot manipulation policies. Specifically, DexScale ensures the usability of simulated skills by integrating diverse forms of realistic data into the simulated environment, preserving semantic alignment with the target tasks. For each simulated skill in the environment, DexScale facilitates effective Sim2Real data scaling by automating the process of domain randomization and adaptation. Tuned by the scaled dataset, the control policy achieves zero-shot Sim2Real generalization across diverse tasks, multiple robot embodiments, and widely studied policy model architectures, highlighting its importance in advancing Sim2Real embodied intelligence.

Paperid:1865 poster

Authors:Tom A. Lamb, Adam Davies, Alasdair Paren, Philip Torr, Francesco Pinto

Title: Focus On This, Not That! Steering LLMs with Adaptive Feature Specification

Abstract:
Despite the success of Instruction Tuning (IT) in training large language models (LLMs), such models often leverage spurious or biased features learnt from their training data and can become misaligned, leading to undesired behaviours. While existing techniques can steer model behaviour at inference-time, they are often post-hoc and do not embed steering as an intrinsic model feature. In this work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified. Across diverse benchmarks, we demonstrate that FIT: (i) successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features. FIT therefore offers a lightweight, intrinsic mechanism for building more robust, fair, and easily controllable LLMs.

Paperid:1866 poster

Authors:Benjamin Feuer, Chinmay Hegde

Title: WildChat-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Abstract:
Language model (LLM) post-training can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.

Paperid:1867 poster

Authors:Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, Dimosthenis Karatzas

Title: DocVXQA: Context-Aware Visual Explanations for Document Question Answering

Abstract:
We proposeDocVXQA, a novel framework for visually self-explainable document question answering, where the goal is not only to produce accurate answers to questions but also to learn visual heatmaps that highlight critical regions, offering interpretable justifications for the model decision. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning criteria. Unlike conventional relevance map methods that solely emphasize regions relevant to the answer, our context-aware DocVXQA delivers explanations that are contextually sufficient yet representation-efficient. This fosters user trust while achieving a balance between predictive performance and interpretability in document visual question answering applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method.

Paperid:1868 poster

Authors:Peiyuan Liu, Beiliang Wu, Yifan Hu, Naiqi Li, Tao Dai, Jigang Bao, Shu-Tao Xia

Title: TimeBridge: Non-Stationarity Matters for Long-term Time Series Forecasting

Abstract:
Non-stationarity poses significant challenges for multivariate time series forecasting due to the inherent short-term fluctuations and long-term trends that can lead to spurious regressions or obscure essential long-term relationships. Most existing methods either eliminate or retain non-stationarity without adequately addressing its distinct impacts on short-term and long-term modeling. Eliminating non-stationarity is essential for avoiding spurious regressions and capturing local dependencies in short-term modeling, while preserving it is crucial for revealing long-term cointegration across variates. In this paper, we propose TimeBridge, a novel framework designed to bridge the gap between non-stationarity and dependency modeling in long-term time series forecasting. By segmenting input series into smaller patches, TimeBridge applies Integrated Attention to mitigate short-term non-stationarity and capture stable dependencies within each variate, while Cointegrated Attention preserves non-stationarity to model long-term cointegration across variates. Extensive experiments show that TimeBridge consistently achieves state-of-the-art performance in both short-term and long-term forecasting. Additionally, TimeBridge demonstrates exceptional performance in financial forecasting on the CSI 500 and S&P 500 indices, further validating its robustness and effectiveness. Code is available athttps://github.com/Hank0626/TimeBridge.

Paperid:1869 poster

Authors:Kefan Dong, Tengyu Ma

Title: STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

Abstract:
A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.1% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%), ProofNet-test (23.9%) and PutnamBench (8/644) with pass@3200.

Paperid:1870 poster

Authors:Jinzhao Li, Nan Jiang, Yexiang Xue

Title: Solving Satisfiability Modulo Counting Exactly with Probabilistic Circuits

Abstract:
Satisfiability Modulo Counting (SMC) is a recently proposed general language to reason about problems integrating statistical and symbolic Artificial Intelligence. An SMC problem is an extended SAT problem in which the truth values of a few Boolean variables are determined by probabilistic inference. Approximate solvers may return solutions that violate constraints. Directly integrating available SAT solvers and probabilistic inference solvers gives exact solutions but results in slow performance because of many back-and-forth invocations of both solvers. We propose KOCO-SMC, an integrated exact SMC solver that efficiently tracks lower and upper bounds in the probabilistic inference process. It enhances computational efficiency by enabling early estimation of probabilistic inference using only partial variable assignments, whereas existing methods require full variable assignments. In the experiment, we compare KOCO-SMC with currently available approximate and exact SMC solvers on large-scale datasets and real-world applications. The proposed KOCO-SMC finds exact solutions with much less time.

Paperid:1871 poster

Authors:Mikael Møller Høgsgaard, Andrea Paudice

Title: Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means

Abstract:
The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavy-tailed data. In this work, we analyze its performance in the task of simultaneously estimating the mean of each function in a classFwhen the data distribution possesses only the firstpmoments forp�?1,2]. We prove a new sample complexity bound using a novel symmetrization technique that may be of independent interest. Additionally, we present applications of our result tok-means clustering with unbounded inputs and linear regression with general losses, improving upon existing works.

Paperid:1872 poster

Authors:Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

Title: DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Abstract:
Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings, and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.

Paperid:1873 poster

Authors:Chhavi Yadav, Evan Laufer, Dan Boneh, Kamalika Chaudhuri

Title: ExpProof : Operationalizing Explanations for Confidential Models with ZKPs

Abstract:
In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with Zero-Knowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests. Our code is publicly available at : \url{https://github.com/emlaufer/ExpProof}.

Paperid:1874 poster

Authors:Huizhuo Yuan, Yifeng Liu, Shuang Wu, zhou Xun, Quanquan Gu

Title: MARS: Unleashing the Power of Variance Reduction for Training Large Models

Abstract:
Training deep neural networks--and more recently, large models--demands efficient and scalable optimizers. Adaptive gradient algorithms like Adam, AdamW, and their variants have been central to this task. Despite the development of numerous variance reduction algorithms in the past decade aimed at accelerating stochastic optimization in both convex and nonconvex settings, variance reduction has not found widespread success in training deep neural networks or large language models. Consequently, it has remained a less favored approach in modern AI. In this paper, to unleash the power of variance reduction for efficient training of large models, we propose a unified optimization framework, MARS (Make vArianceReductionShine), which reconciles preconditioned gradient methods with variance reduction via a scaled stochastic recursive momentum technique. Within our framework, we introduce three instances of MARS that leverage preconditioned gradient updates based on AdamW, Lion, and Shampoo, respectively. We also draw a connection between our algorithms and existing optimizers. Experimental results on training GPT-2 models indicate that MARS consistently outperforms AdamW by a large margin.

Paperid:1875 poster

Authors:Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng

Title: KIND: Knowledge Integration and Diversion for Training Decomposable Models

Abstract:
Pre-trained models have become the preferred backbone due to the increasing complexity of model parameters. However, traditional pre-trained models often face deployment challenges due to their fixed sizes, and are prone to negative transfer when discrepancies arise between training tasks and target tasks. To address this, we proposeKIND, a novel pre-training method designed to construct decomposable models. KIND integrates knowledge by incorporating Singular Value Decomposition (SVD) as a structural constraint, with each basic component represented as a combination of a column vector, singular value, and row vector from $U$, $\Sigma$, and $V^\top$ matrices. These components are categorized intolearngenesfor encapsulating class-agnostic knowledge and \textbf{tailors} for capturing class-specific knowledge, with knowledge diversion facilitated by a class gate mechanism during training. Extensive experiments demonstrate that models pre-trained with KIND can be decomposed into learngenes and tailors, which can be adaptively recombined for diverse resource-constrained deployments. Moreover, for tasks with large domain shifts, transferring only learngenes with task-agnostic knowledge, when combined with randomly initialized tailors, effectively mitigates domain shifts. Code will be made available athttps://github.com/Te4P0t/KIND.

Paperid:1876 poster

Authors:Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis

Title: Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Abstract:
We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once. Our core insight is to train a single, overcomplete sparse autoencoder (SAE) that ingests activations from any model and decodes them to approximate the activations of any other model under consideration. By optimizing a shared objective, the learned dictionary captures common factors of variation—concepts—across different tasks, architectures, and datasets. We show that USAEs discover semantically coherent and important universal concepts across vision models; ranging from low-level features (e.g., colors and textures) to higher-level structures (e.g., parts and objects). Overall, USAEs provide a powerful new method for interpretable cross-model analysis and offers novel applications—such as coordinated activation maximization—that open avenues for deeper insights in multi-model AI systems.

Paperid:1877 poster

Authors:Nurbek Tastan, Samuel Horváth, Karthik Nandakumar

Title: Aequa: Fair Model Rewards in Collaborative Learning via Slimmable Networks

Abstract:
Collaborative learning enables multiple participants to learn a single global model by exchanging focused updates instead of sharing data. One of the core challenges in collaborative learning is ensuring that participants are rewarded fairly for their contributions, which entails two key sub-problems: contribution assessment and reward allocation. This work focuses on fair reward allocation, where the participants are incentivized through model rewards - differentiated final models whose performance is commensurate with the contribution. In this work, we leverage the concept of slimmable neural networks to collaboratively learn a shared global model whose performance degrades gracefully with a reduction in model width. We also propose a post-training fair allocation algorithm that determines the model width for each participant based on their contributions. We theoretically study the convergence of our proposed approach and empirically validate it using extensive experiments on different datasets and architectures. We also extend our approach to enable training-time model reward allocation.

Paperid:1878 poster

Authors:Yuanchen Wu, Ke Yan, Shouhong Ding, Ziyin Zhou, Xiaoqiang Li

Title: Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Abstract:
Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning�?approach, which modifies the model’s response format to require a rationale before deriving answer without explicit prompts. Next, SRC searches a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

Paperid:1879 poster

Authors:Kun Cheng, Xiao He, Lei Yu, Zhijun Tu, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu

Title: Diff-MoE: Diffusion Transformer with Time-Aware and Space-Adaptive Experts

Abstract:
Diffusion models have transformed generative modeling but suffer from scalability limitations due to computational overhead and inflexible architectures that process all generative stages and tokens uniformly. In this work, we introduce Diff-MoE, a novel framework that combines Diffusion Transformers with Mixture-of-Experts to exploit both temporarily adaptability and spatial flexibility. Our design incorporates expert-specific timestep conditioning, allowing each expert to process different spatial tokens while adapting to the generative stage, to dynamically allocate resources based on both the temporal and spatial characteristics of the generative task. Additionally, we propose a globally-aware feature recalibration mechanism that amplifies the representational capacity of expert modules by dynamically adjusting feature contributions based on input relevance. Extensive experiments on image generation benchmarks demonstrate that Diff-MoE significantly outperforms state-of-the-art methods. Our work demonstrates the potential of integrating diffusion models with expert-based designs, offering a scalable and effective framework for advanced generative modeling.

Paperid:1880 poster

Authors:Hanxun Huang, Sarah Monazam Erfani, Yige Li, Xingjun Ma, James Bailey

Title: X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP

Abstract:
As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduceX-Transfer, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property assuper transferability—a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved throughsurrogate scaling, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models.

Paperid:1881 poster

Authors:Xubin Wang, Jianfei Wu, Yuan Yichen, Deyu Cai, Mingzhe Li, Weijia Jia

Title: Demonstration Selection for In-Context Learning via Reinforcement Learning

Abstract:
Diversity in demonstration selection is critical for enhancing model generalization by enabling broader coverage of structures and concepts. Constructing appropriate demonstration sets remains a key research challenge. This paper introduces the Relevance-Diversity Enhanced Selection (RDES), an innovative approach that leverages reinforcement learning (RL) frameworks to optimize the selection of diverse reference demonstrations for tasks amenable to in-context learning (ICL), particularly text classification and reasoning, in few-shot prompting scenarios. RDES employs frameworks like Q-learning and a PPO-based variant to dynamically identify demonstrations that maximize both diversity (quantified by label distribution) and relevance to the task objective. This strategy ensures a balanced representation of reference data, leading to improved accuracy and generalization. Through extensive experiments on multiple benchmark datasets, including diverse reasoning tasks, and involving 14 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances performance compared to ten established baselines. Our evaluation includes analysis of performance across varying numbers of demonstrations on selected datasets. Furthermore, we investigate incorporating Chain-of-Thought (CoT) reasoning, which further boosts predictive performance. The results highlight the potential of RL for adaptive demonstration selection and addressing challenges in ICL.

Paperid:1882 poster

Authors:Ian Gemp, Andreas Alexander Haupt, Luke Marris, Siqi Liu, Georgios Piliouras

Title: Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning

Abstract:
Behavioral diversity, expert imitation, fairness, safety goals and others give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist. Furthermore, equilibria can be approximated empirically by performing gradient descent on an upper bound of exploitability. Our experiments reveal novel solutions to classic repeated normal-form games, find fair solutions in a repeated asymmetric coordination game, and prioritize safe long-term behavior in a robot warehouse environment. In the prisoner's dilemma, our algorithm leverages transient imitation to find a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.

Paperid:1883 poster

Authors:Aoran Wang, Xinnan Dai, Jun Pang

Title: Guided Structural Inference: Leveraging Priors with Soft Gating Mechanisms

Abstract:
Existing methods for inferring latent relational structures struggle to integrate partial prior knowledge, such as known edges, node-degree constraints, and global sparsity, without destabilizing training or conflicting with probabilistic assumptions. We propose Soft-Gated Structural Inference (SGSI), a VAE framework that seamlessly incorporates domain constraints via (1) soft gating with learnable edge masks to preserve gradients, (2) cloning-clamping of deterministic edges to avoid distributional conflicts, and (3) adaptive regularization to balance data-driven learning with domain constraints. By excluding known edges from stochastic inference, SGSI reallocates capacity to uncertain interactions, optimizing the information bottleneck trade-off. Experiments on 16 datasets show SGSI improves edge recovery by up to $9$% AUROC over baselines, scales to larger graphs ($94.2$% AUROC), and maintains stable training. SGSI bridges domain expertise with data-driven learning, enabling interpretable and robust structural discovery in dynamical systems.

Paperid:1884 poster

Authors:Min Chen, Guansong Pang, Wenjun Wang, Cheng Yan

Title: Information Bottleneck-guided MLPs for Robust Spatial-temporal Forecasting

Abstract:
Spatial-temporal forecasting (STF) plays a pivotal role in urban planning and computing. Spatial-Temporal Graph Neural Networks (STGNNs) excel at modeling spatial-temporal dynamics, thus being robust against noise perturbations. However, they often suffer from relatively poor computational efficiency. Simplifying the architectures can improve efficiency but also weakens robustness with respect to noise interference. In this study, we investigate the problem:can simple neural networks such as Multi-Layer Perceptrons (MLPs) achieve robust spatial-temporal forecasting while remaining efficient?To this end, we first reveal thedual noise effectin spatial-temporal data and propose a theoretically grounded principle termedRobust Spatial-Temporal Information Bottleneck(RSTIB), which holds strong potential for improving model robustness. We then design an implementation namedRSTIB-MLP, together with a new training regime incorporating a knowledge distillation module, to enhance the robustness of MLPs for STF while maintaining their efficiency. Comprehensive experiments demonstrate thatRSTIB-MLPachieves an excellent trade-off between robustness and efficiency, outperforming state-of-the-art STGNNs and MLP-based models. Our code is publicly available at:https://github.com/mchen644/RSTIB.

Paperid:1885 poster

Authors:Zeming Wei, Steven Y. Guo, Yisen Wang

Title: Identifying and Understanding Cross-Class Features in Adversarial Training

Abstract:
Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available athttps://github.com/PKU-ML/Cross-Class-Features-AT.

Paperid:1886 poster

Authors:Aidan Curtis, Eric Li, Michael Noseworthy, Nishad Gothoskar, Sachin Chitta, Hui Li, Leslie Pack Kaelbling, Nicole E Carey

Title: Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Abstract:
Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies learned in simulation. By randomizing properties of the environment during training, the learned policy can be robust to uncertainty along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate the problem of automatically discovering this sampling distribution via entropy-regularized reward maximization of a neural sampling distribution in the form of a normalizing flow. We show that this architecture is more flexible and results in better robustness than existing approaches to learning simple parameterized sampling distributions. We demonstrate that these policies can be used to learn robust policies for contact-rich assembly tasks. Additionally, we explore how these sampling distributions, in combination with a privileged value function, can be used for out-of-distribution detection in the context of an uncertainty-aware multi-step manipulation planner.

Paperid:1887 poster

Authors:Haoxuan Li, Zeyu Tang, Zhichao Jiang, Zhuangyan Fang, Yue Liu, Zhi Geng, Kun Zhang

Title: Fairness on Principal Stratum: A New Perspective on Counterfactual Fairness

Abstract:
Fairness in human and algorithmic decision-making is crucial in areas such as criminal justice, education, and social welfare. Recently, counterfactual fairness has drawn increasing research interest, suggesting that decision-making for individuals should remain the same when intervening with different values on protected attributes. Nevertheless, the question of "which attributes and individuals should be protected" is rarely discussed in the existing counterfactual fairness literature. For example, when considering leg disability as a protected attribute, the algorithms should not treat individuals with leg disabilities differently in college admissions, but one may naturally consider this factor when selecting runner athletes. In other words, when and how to enforce fairness is expected to depend on the causal relation between the protected attribute and the outcome of interest. Formally, this paper proposes principal counterfactual fairness using the concept of principal stratification from the causal inference literature, focusing on whether an algorithm is counterfactually fair for individuals whose protected attribute has no individual causal effect on the outcome of interest. To examine whether an algorithm satisfies principal counterfactual fairness, we derive the statistical bounds and propose a post-processing approach to achieving principal counterfactual fairness with minimal individual decision changes. Experiments are conducted using synthetic and real-world datasets to verify the effectiveness of our methods.

Paperid:1888 poster

Authors:Joseph Lazzaro, Ciara Pike-Burke

Title: Fixed-Confidence Multiple Change Point Identification under Bandit Feedback

Abstract:
Piecewise constant functions describe a variety of real-world phenomena in domains ranging from chemistry to manufacturing. In practice, it is often required to confidently identify the locations of the abrupt changes in these functions as quickly as possible. For this, we introduce a fixed-confidence piecewise constant bandit problem. Here, we sequentially query points in the domain and receive noisy evaluations of the function under bandit feedback. We provide instance-dependent lower bounds for the complexity of change point identification in this problem. These lower bounds illustrate that an optimal method should focus its sampling efforts adjacent to each of the change points, and the number of samples around each change point should be inversely proportional to the magnitude of the change. Building on this, we devise a simple and computationally efficient variant of Track-and-Stop and prove that it is asymptotically optimal in many regimes. We support our theoretical findings with experimental results in synthetic environments demonstrating the efficiency of our method.

Paperid:1889 poster

Authors:Arun Ganesh, Ryan McKenna, Hugh Brendan McMahan, Adam Smith, Fan Wu

Title: It's My Data Too: Private ML for Datasets with Multi-User Training Examples

Abstract:
We initiate a study of algorithms for model training with user-level differential privacy (DP), where each example may be attributed to multiple users, which we call the multi-attribution model. We first provide a carefully chosen definition of user-level DP under the multi-attribution model. Training in the multi-attribution model is facilitated by solving the contribution bounding problem, i.e. the problem of selecting a subset of the dataset for which each user is associated with a limited number of examples. We propose a greedy baseline algorithm for the contribution bounding problem. We then empirically study this algorithm for a synthetic logistic regression task and a transformer training task, including studying variants of this baseline algorithm that optimize the subset chosen using different techniques and criteria. We find that the baseline algorithm remains competitive with its variants in most settings, and build a better understanding of the practical importance of a bias-variance tradeoff inherent in solutions to the contribution bounding problem.

Paperid:1890 poster

Authors:Bo Li, Fangxiao Wang, Xing Shiji

Title: Settling the Maximin Share Fairness for Scheduling among Groups of Machines

Abstract:
We study the fair scheduling of jobs among groups of (unrelated) machines and focus on the maximin share (MMS) fairness at the group level. The problem was first introduced by Li et al. [NeurIPS 2023], where each group consists of a number of identical machines (or identical up to different speeds), and the cost of a group is determined by the minimum makespan on completing all jobs assigned to it. It is left as an open problem when the machines within each group are unrelated. In this paper, we first resolve this problem and design a polynomial-time algorithm that computes a 2-approximate MMS allocation via linear programming techniques. We complement this result with a hard instance, showing that no algorithm can be better than $(2-\frac{1}{n})$-approximate MMS, where $n$ is the number of machines. Thus the approximation ratio 2 is asymptotically tight. When the groups consist of identical machines, we improve the approximation ratio to $\frac{4}{3}$.

Paperid:1891 poster

Authors:Maricela Best Mckay, Avleen Kaur, Chen Greif, Brian Wetton

Title: Near-optimal Sketchy Natural Gradients for Physics-Informed Neural Networks

Abstract:
Natural gradient methods for PINNs have achieved state-of-the-art performance with errors several orders of magnitude smaller than those achieved by standard optimizers such as ADAM or L-BFGS. However, computing natural gradients for PINNs is prohibitively computationally costly and memory-intensive for all but small neural network architectures. We develop a randomized algorithm for natural gradient descent for PINNs that uses sketching to approximate the natural gradient descent direction. We prove that the change of coordinate Gram matrix used in a natural gradient descent update has rapidly-decaying eigenvalues for a one-layer, one-dimensional neural network and empirically demonstrate that this structure holds for four different example problems. Under this structure, our sketching algorithm is guaranteed to provide a near-optimal low-rank approximation of the Gramian. Our algorithm dramatically speeds up computation time and reduces memory overhead. Additionally, in our experiments, the sketched natural gradient outperforms the original natural gradient in terms of accuracy, often achieving an error that is an order of magnitude smaller. Training time for a network with around 5,000 parameters is reduced from several hours to under two minutes. Training can be practically scaled to large network sizes; we optimize a PINN for a network with over a million parameters within a few minutes, a task for which the full Gram matrix does not fit in memory.

Paperid:1892 poster

Authors:Farzad Farhadzadeh, Debasmit Das, Shubhankar Borse, Fatih Porikli

Title: Zero-Shot Adaptation of Parameter-Efficient Fine-Tuning in Diffusion Models

Abstract:
We introduce ProLoRA, enabling zero-shot adaptation of parameter-efficient fine-tuning in text-to-image diffusion models. ProLoRA transfers pre-trained low-rank adjustments (e.g., LoRA) from a source to a target model without additional training data. This overcomes the limitations of traditional methods that require retraining when switching base models, often challenging due to data constraints. ProLoRA achieves this via projection of source adjustments into the target model's weight space, leveraging subspace and null space similarities and selectively targeting aligned layers. Evaluations on established text-to-image models demonstrate successful knowledge transfer and comparable performance without retraining.

Paperid:1893 poster

Authors:Hongyi Ling, Haiyang Yu, Zhimeng Jiang, Na Zou, Shuiwang Ji

Title: On Explaining Equivariant Graph Networks via Improved Relevance Propagation

Abstract:
We consider explainability in equivariant graph neural networks for 3D geometric graphs. While many XAI methods have been developed for analyzing graph neural networks, they predominantly target 2D graph structures. The complex nature of 3D data and the sophisticated architectures of equivariant GNNs present unique challenges. Current XAI techniques either struggle to adapt to equivariant GNNs or fail to effectively handle positional data and evaluate the significance of geometric features adequately. To address these challenges, we introduce a novel method, known as EquiGX, which uses the Deep Taylor decomposition framework to extend the layer-wise relevance propagation rules tailored for spherical equivariant GNNs. Our approach decomposes prediction scores and back-propagates the relevance scores through each layer to the input space. Our decomposition rules provide a detailed explanation of each layer’s contribution to the network’s predictions, thereby enhancing our understanding of how geometric and positional data influence the model’s outputs. Through experiments on both synthetic and real-world datasets, our method demonstrates its capability to identify critical geometric structures and outperform alternative baselines. These results indicate that our method provides significantly enhanced explanations for equivariant GNNs. Our code has been released as part of the AIRS library (https://github.com/divelab/AIRS/).

Paperid:1894 poster

Authors:Quentin Bertrand, Juan Agustin Duque, Emilio Calvano, Gauthier Gidel

Title: Self-PlayQ-Learners Can Provably Collude in the Iterated Prisoner's Dilemma

Abstract:
A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect�?policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

Paperid:1895 poster

Authors:Mateusz Olko, Mateusz Gajewski, Joanna Wojciechowska, Mikołaj Morzy, Piotr Sankowski, Piotr Miłoś

Title: Since Faithfulness Fails: The Performance Limits of Neural Causal Discovery

Abstract:
Neural causal discovery methods have recently improved in terms of scalability and computational efficiency. However, our systematic evaluation highlights significant room for improvement in their accuracy when uncovering causal structures. We identify a fundamental limitation: \textit{unavoidable likelihood score estimation errors disallow distinguishing the true structure}, even for small graphs and relatively large sample sizes. Furthermore, we identify the faithfulness property as a critical bottleneck: (i) it is likely to be violated across any reasonable dataset size range, and (ii) its violation directly undermines the performance of neural penalized-likelihood discovery methods. These findings lead us to conclude that progress within the current paradigm is fundamentally constrained, necessitating a paradigm shift in this domain.

Paperid:1896 poster

Authors:Ashkan Shahbazi, Elaheh Akbari, Darian Salehi, Xinran Liu, Navid NaderiAlizadeh, Soheil Kolouri

Title: ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

Abstract:
While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at \url{https://github.com/dariansal/ESPFormer}.

Paperid:1897 poster

Authors:Noa Rubin, Kirsten Fischer, Javed Lindner, Inbar Seroussi, Zohar Ringel, Michael Krämer, Moritz Helias

Title: From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

Abstract:
Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving directional changes to the kernel. The relationship and respective strengths of these two views have so far remained unresolved. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these two views. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output in the special case of a linear network. However, for linear and non-linear networks, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.

Paperid:1898 poster

Authors:Christian Pedersen, Laure Zanna, Joan Bruna

Title: Thermalizer: Stable autoregressive neural emulation of spatiotemporal chaos

Abstract:
Autoregressive surrogate models (oremulators) of spatiotemporal systems provide an avenue for fast, approximate predictions, with broad applications across science and engineering. At inference time however, these models are generally unable to provide predictions over long time rollouts due to accumulation of errors leading to diverging trajectories. In essence, emulators operate out of distribution, and controlling the online distribution quickly becomes intractable in large-scale settings. To address this fundamental issue, and focusing on time-stationary systems admitting an invariant measure, we leverage diffusion models to obtain an implicit estimator of the score of this invariant measure. We show that this model of the score function can be used to stabilize autoregressive emulator rollouts by applying on-the-fly denoising during inference, a process we callthermalization. Thermalizing an emulator rollout is shown to extend the time horizon of stable predictions by two orders of magnitude in complex systems exhibiting turbulent and chaotic behavior, opening up a novel application of diffusion models in the context of neural emulation.

Paperid:1899 poster

Authors:Jian-Feng Cai, Zhuozhi XIAN, Jiaxi Ying

Title: Fast and Provable Algorithms for Sparse PCA with Improved Sample Complexity

Abstract:
We explore the single-spiked covariance model within the context of sparse principal component analysis (PCA), which aims to recover a sparse unit vector from noisy samples. From an information-theoretic perspective, $O(k \log p)$ observations are sufficient to recover a $k$-sparse $p$-dimensional vector $\mathbf{v}$. However, existing polynomial-time methods require at least $O(k^2)$ samples for successful recovery, highlighting a significant gap in sample efficiency. To bridge this gap, we introduce a novel thresholding-based algorithm that requires only $\Omega(k \log p)$ samples, provided the signal strength $\lambda = \Omega(||\mathbf{v}||_\infty^{-1})$. We also propose a two-stage nonconvex algorithm that further enhances estimation performance. This approach integrates our thresholding algorithm with truncated power iteration, achieving the minimax optimal rate of statistical error under the desired sample complexity. Numerical experiments validate the superior performance of our algorithms in terms of estimation accuracy and computational efficiency.

Paperid:1900 poster

Authors:Sihui Dai, Christian Cianfarani, Vikash Sehwag, Prateek Mittal, Arjun Bhagoji

Title: Adapting to Evolving Adversaries with Regularized Continual Robust Training

Abstract:
Robust training methods typically defend against specific attack types, such as $\ell_p$ attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended model to new adversaries as they arise via fine-tuning, a method which we call continual robust training (CRT). However, when implemented naively, fine-tuning on new attacks degrades robustness on previous attacks. This raises the question: \textit{how can we improve the initial training and fine-tuning of the model to simultaneously achieve robustness against previous and new attacks?} We present theoretical results which show that the gap in a model's robustness against different attacks is bounded by how far each attack perturbs a sample in the model's logit space, suggesting that regularizing with respect to this logit space distance can help maintain robustness against previous attacks. Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and ImageNette) and over 100 attack combinations demonstrate that the proposed regularization improves robust accuracy with little overhead in training time. Our findings and open-source code lay the groundwork for the deployment of models robust to evolving attacks.

Paperid:1901 poster

Authors:Yifeng Wang, Xueying Zhan, Siyu Huang

Title: AutoAL: Automated Active Learning with Differentiable Query Strategy Search

Abstract:
As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both time-consuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectiveness of different AL algorithms can vary significantly across data scenarios, and determining which AL algorithm best fits a given task remains a challenging problem. This work presents the first differentiable AL strategy search method, named AutoAL, which is designed on top of existing AL sampling strategies. AutoAL consists of two neural nets, named SearchNet and FitNet, which are optimized concurrently under a differentiable bi-level optimization framework. For any given task, SearchNet and FitNet are iteratively co-optimized using the labeled data, learning how well a set of candidate AL algorithms perform on that task. With the optimal AL strategies identified, SearchNet selects a small subset from the unlabeled pool for querying their annotations, enabling efficient training of the task model. Experimental results demonstrate that AutoAL consistently achieves superior accuracy compared to all candidate AL algorithms and other selective AL approaches, showcasing its potential for adapting and integrating multiple existing AL methods across diverse tasks and domains.

Paperid:1902 poster

Authors:Ramya Ramalingam, Shayan Kiyani, Aaron Roth

Title: The Relationship Between No-Regret Learning and Online Conformal Prediction

Abstract:
Existing algorithms for online conformal prediction---guaranteeing marginal coverage in adversarial settings---are variants of online gradient descent (OGD), but their analyses of worst-case coverage do not follow from the regret guarantee of OGD. What is the relationship between no-regret learning and online conformal prediction? We observe that although standard regret guarantees imply marginal coverage in i.i.d. settings, this connection fails as soon as we either move to adversarial environments or ask for group conditional coverage. On the other hand, we show a tight connection betweenthreshold calibratedcoverage and swap-regret in adversarial settings, which extends to group-conditional (multi-valid) coverage. We also show that algorithms in thefollow the regularized leaderfamily of no regret learning algorithms (which includes online gradient descent) can be used to give group-conditional coverage guarantees in adversarial settings for arbitrary grouping functions. Via this connection we analyze and conduct experiments using a multi-group generalization of the ACI algorithm of Gibbs & Candes (2021).

Paperid:1903 poster

Authors:Xiaoyu Ma, Hao Chen, Yongjian Deng

Title: Improving Multimodal Learning Balance and Sufficiency through Data Remixing

Abstract:
Different modalities hold considerable gaps in optimization trajectories, including speeds and paths, which lead tomodality lazinessandmodality clashwhen jointly training multimodal models, resulting in insufficient and imbalanced multimodal learning. Existing methods focus on enforcing the weak modality by adding modality-specific optimization objectives, aligning their optimization speeds, or decomposing multimodal learning to enhance unimodal learning. These methods fail to achieve both unimodal sufficiency and multimodal balance. In this paper, we, for the first time, address both concerns by proposing multimodal Data Remixing, including decoupling multimodal data and filtering hard samples for each modality to mitigate modality imbalance; and then batch-level reassembling to align the gradient directions and avoid cross-modal interference, thus enhancing unimodal learning sufficiency. Experimental results demonstrate that our method can be seamlessly integrated with existing approaches, improving accuracy by approximately6.50%$\uparrow$on CREMAD and3.41%$\uparrow$on Kinetic-Sounds, without training set expansion or additional computational overhead during inference. The source code is available at Data Remixing.

Paperid:1904 poster

Authors:Suchith Chidananda Prabhu, Bhavyajeet Singh, Anshul Mittal, Siddarth Asokan, Shikhar Mohan, Deepak Saini, Yashoteja Prabhu, Lakshya Kumar, Jian Jiao, Amit S, Niket Tandon, Manish Gupta, Sumeet Agarwal, Manik Varma

Title: MOGIC: Metadata-infused Oracle Guidance for Improved Extreme Classification

Abstract:
Retrieval-augmented classification and generation models benefit fromearly-stage fusionof high-quality text-based metadata, often called memory, but face high latency and noise sensitivity. In extreme classification (XC), where low latency is crucial, existing methods uselate-stage fusionfor efficiency and robustness. To enhance accuracy while maintaining low latency, we propose MOGIC, a novel approach to metadata-infused oracle guidance for XC. We train an early-fusion oracle classifier with access to both query-side and label-side ground-truth metadata in textual form and subsequently use it to guide existing memory-based XC disciple models via regularization. The MOGIC algorithm improves precision@1 and propensity-scored precision@1 of XC disciple models by 1-2% on six standard datasets, at no additional inference-time cost. We show that MOGIC can be used in a plug-and-play manner to enhance memory-free XC models such as NGAME or DEXA. Lastly, we demonstrate the robustness of the MOGIC algorithm to missing and noisy metadata. The code is publicly available athttps://github.com/suchith720/mogic.

Paperid:1905 poster

Authors:Lukas Thede, Karsten Roth, Matthias Bethge, Zeynep Akata, Thomas Hartvigsen

Title: WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs

Abstract:
Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale. We first introduce \texttt{WikiBigEdit}; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use \texttt{WikiBigEdit} to study existing knowledge editing techniques' ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.

Paperid:1906 poster

Authors:Shiming Chen, Dingjie Fu, Salman Khan, Fahad Shahbaz Khan

Title: GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder

Abstract:
Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than60×faster training speed on AWA2. Codes are available at \url{https://github.com/shiming-chen/GenZSL}.

Paperid:1907 poster

Authors:Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, Kun Yuan

Title: Subspace Optimization for Large Language Models with Convergence Guarantees

Abstract:
Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduceGoLore(Gradient randomLow-rank projection), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization algorithms. Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available athttps://github.com/pkumelon/Golore.

Paperid:1908 poster

Authors:Zhenglin Wan, Xingrui Yu, David Mark Bossens, Yueming Lyu, Qing Guo, Flint Xiaofeng Fan, Yew-Soon Ong, Ivor Tsang

Title: Diversifying Robot Locomotion Behaviors with Extrinsic Behavioral Curiosity

Abstract:
Imitation learning (IL) has shown promise in robot locomotion but is often limited to learning a single expert policy, constraining behavior diversity and robustness in unpredictable real-world scenarios. To address this, we introduce Quality Diversity Inverse Reinforcement Learning (QD-IRL), a novel framework that integrates quality-diversity optimization with IRL methods, enabling agents to learn diverse behaviors from limited demonstrations. This work introduces Extrinsic Behavioral Curiosity (EBC), which allows agents to receive additional curiosity rewards from an external critic based on how novel the behaviors are with respect to a large behavioral archive. To validate the effectiveness of EBC in exploring diverse locomotion behaviors, we evaluate our method on multiple robot locomotion tasks. EBC improves the performance of QD-IRL instances with GAIL, VAIL, and DiffAIL across all included environments by up to 185%, 42%, and 150%, even surpassing expert performance by 20% in Humanoid. Furthermore, we demonstrate that EBC is applicable to Gradient-Arborescence-based Quality Diversity Reinforcement Learning (QD-RL) algorithms, where it substantially improves performance and provides a generic technique for diverse robot locomotion. The source code of this work is provided athttps://github.com/vanzll/EBC.

Paperid:1909 poster

Authors:William H English, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz

Title: Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs

Abstract:
Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a grounding of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate grounding, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the grounding and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT improves the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.

Paperid:1910 poster

Authors:Runsheng Bai, Bo Liu, qiang liu

Title: SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

Abstract:
Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called \textbf{SKIM}: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A \textit{greedy algorithm} to solve approximately optimal bit allocation across weight channels, and 2. A \textit{trainable scaling vector} for non-differentiable K-means clustering. These techniques substantially improve the model performance and can be adapted to any given bit. Notably, in terms of perplexity, our method narrows the gap between quantized LLaMA models and their full precision counterparts by around \textbf{14%} on average.

Paperid:1911 poster

Authors:Xu Zhang, Haoye Qiu, Weixuan Liang, Hui LIU, Junhui Hou, Yuheng Jia

Title: Generalization Performance of Ensemble Clustering: From Theory to Algorithm

Abstract:
Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1%, 7.3%, and 6.0% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available athttps://github.com/xuz2019/GPEC.

Paperid:1912 poster

Authors:David Reber, Sean M Richardson, Todd Nief, Cristina Garbacea, Victor Veitch

Title: RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

Abstract:
Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures thecausaleffect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewritingtwice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

Paperid:1913 poster

Authors:Urszula Julia Komorowska, Francisco Vargas, Alessandro Rondina, Pietro Lio, Mateja Jamnik

Title: NMA-tune: Generating Highly Designable and Dynamics Aware Protein Backbones

Abstract:
Protein's backbone flexibility is a crucial property that heavily influences its functionality. Recent work in the field of protein diffusion probabilistic modelling has leveraged Normal Mode Analysis (NMA) and, for the first time, introduced information about large scale protein motion into the generative process. However, obtaining molecules with both the desired dynamics and designable quality has proven challenging. In this work, we present NMA-tune, a new method that introduces the dynamics information to the protein design stage. NMA-tune uses a trainable component to condition the backbone generation on the lowest normal mode of oscillation. We implement NMA-tune as a plug-and-play extension to RFdiffusion, show that the proportion of samples with high quality structure and the desired dynamics is improved as compared to other methods without the trainable component, and we show the presence of the targeted modes in the Molecular Dynamics simulations.

Paperid:1914 poster

Authors:Ismail Alkhouri, Shijun Liang, Cheng-Han Huang, Jimmy Dai, Qing Qu, Saiprasad Ravishankar, Rongrong Wang

Title: SITCOM: Step-wise Triple-Consistent Diffusion Sampling For Inverse Problems

Abstract:
Diffusion models (DMs) are a class of generative models that allow sampling from a distribution learned over a training set. When applied to solving inverse problems, the reverse sampling steps are modified to approximately sample from a measurement-conditioned distribution. However, these modifications may be unsuitable for certain settings (e.g., presence of measurement noise) and non-linear tasks, as they often struggle to correct errors from earlier steps and generally require a large number of optimization and/or sampling steps. To address these challenges, we state three conditions for achieving measurement-consistent diffusion trajectories. Building on these conditions, we propose a new optimization-based sampling method that not only enforces standard data manifold measurement consistency and forward diffusion consistency, as seen in previous studies, but also incorporates our proposed step-wise and network-regularized backward diffusion consistency that maintains a diffusion trajectory by optimizing over the input of the pre-trained model at every sampling step. By enforcing these conditions (implicitly or explicitly), our sampler requires significantly fewer reverse steps. Therefore, we refer to our method asStep-wiseTriple-Consistent Sampling (SITCOM). Compared to SOTA baselines, our experiments across several linear and non-linear tasks (with natural and medical images) demonstrate that SITCOM achieves competitive or superior results in terms of standard similarity metrics and run-time.

Paperid:1915 poster

Authors:Mathilde Papillon, Guillermo Bernardez, Claudio Battiloro, Nina Miolane

Title: TopoTune: A Framework for Generalized Combinatorial Complex Neural Networks

Abstract:
Graph Neural Networks (GNNs) effectively learn from relational data by leveraging graph symmetries. However, many real-world systems---such as biological or social networks---feature multi-way interactions that GNNs fail to capture. Topological Deep Learning (TDL) addresses this by modeling and leveraging higher-order structures, with Combinatorial Complex Neural Networks (CCNNs) offering a general and expressive approach that has been shown to outperform GNNs. However, TDL lacks the principled and standardized frameworks that underpin GNN development, restricting its accessibility and applicability. To address this issue, we introduce Generalized CCNNs (GCCNs), a simple yet powerful family of TDL models that can be used to systematically transform any (graph) neural network into its TDL counterpart. We prove that GCCNs generalize and subsume CCNNs, while extensive experiments on a diverse class of GCCNs show that these architectures consistently match or outperform CCNNs, often with less model complexity. In an effort to accelerate and democratize TDL, we introduce TopoTune, a lightweight software for defining, building, and training GCCNs with unprecedented flexibility and ease.

Paperid:1916 poster

Authors:Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi

Title: Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions

Abstract:
Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are bothactionableandinformative---two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of $0.319$ in Attack Success Rate and $0.426$ in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.

Paperid:1917 poster

Authors:Yeqing Qiu, Ye XUE, Akang Wang, Yiheng Wang, Qingjiang Shi, Zhi-Quan Luo

Title: ROS: A GNN-based Relax-Optimize-and-Sample Framework for Max-k-Cut Problems

Abstract:
The Max-k-Cut problem is a fundamental combinatorial optimization challenge that generalizes the classicNP-complete Max-Cut problem. While relaxation techniques are commonly employed to tackle Max-k-Cut, they often lack guarantees of equivalence between the solutions of the original problem and its relaxation. To address this issue, we introduce the Relax-Optimize-and-Sample (ROS) framework. In particular, we begin by relaxing the discrete constraints to the continuous probability simplex form. Next, we pre-train and fine-tune a graph neural network model to efficiently optimize the relaxed problem. Subsequently, we propose a sampling-based construction algorithm to map the continuous solution back to a high-quality Max-k-Cut solution. By integrating geometric landscape analysis with statistical theory, we establish the consistency of function values between the continuous solution and its mapped counterpart. Extensive experimental results on random regular graphs and the Gset benchmark demonstrate that the proposed ROS framework effectively scales to large instances with up to20,000nodes in just a few seconds, outperforming state-of-the-art algorithms. Furthermore, ROS exhibits strong generalization capabilities across both in-distribution and out-of-distribution instances, underscoring its effectiveness for large-scale optimization tasks.

Paperid:1918 poster

Authors:Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, Tsung-Hui Chang

Title: Inference-Time Alignment of Diffusion Models with Direct Noise Optimization

Abstract:
In this work, we focus on the alignment problem of diffusion models with a continuous reward function, which represents specific objectives for downstream tasks, such as increasing darkness or improving the aesthetics of images. The central goal of the alignment problem is to adjust the distribution learned by diffusion models such that the generated samples maximize the target reward function. We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimizes the injected noise during the sampling process of diffusion models. By design, DNO operates at inference-time, and thus is tuning-free and prompt-agnostic, with the alignment occurring in an online fashion during generation. We rigorously study the theoretical properties of DNO and also propose variants to deal with non-differentiable reward functions. Furthermore, we identify that naive implementation of DNO occasionally suffers from the out-of-distribution reward hacking problem, where optimized samples have high rewards but are no longer in the support of the pretrained distribution. To remedy this issue, we leverage classical high-dimensional statistics theory to an effective probability regularization technique. We conduct extensive experiments on several important reward functions and demonstrate that the proposed DNO approach can achieve state-of-the-art reward scores within a reasonable time budget for generation.

Paperid:1919 poster

Authors:Ruocheng Wang, Zhuo Xia, Ge Yan, Junchi Yan

Title: QuanONet: Quantum Neural Operator with Application to Differential Equation

Abstract:
Differential equations are essential and popular in science and engineering. Learning-based methods including neural operators, have emerged as a promising paradigm. We explore its quantum counterpart, and propose QuanONet -- a quantum neural operator which has not been well studied in literature compared with their counterparts in other machine learning areas. We design a novel architecture as a hardware-efficient ansatz, in the era of noisy intermediate-scale quantum (NISQ). Its circuit is pure quantum. By lying its ground on the operator approximation theorem for its quantum counterpart, QuanONet in theory can fit various differential equation operators. We also propose its modified version TF-QuanONet with ability to adaptively fit the dominant frequency of the problem. The real-device empirical results on problems including anti-derivative operators, Diffusion-reaction Systems demonstrate that QuanONet outperforms peer quantum methods when their model sizes are set akin to QuanONet.

Paperid:1920 poster

Authors:Pratinav Seth, Michelle Lin, BREFO DWAMENA YAW, Jade Boutot, Mary Kang, David Rolnick

Title: Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery

Abstract:
Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the groundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first large-scale Benchmark dataset for this problem, leveraging high-resolution multi-spectral satellite imagery from Planet Labs. Our curated Dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.

Paperid:1921 poster

Authors:Claas A Voelcker, Anastasiia Pedan, Arash Ahmadian, Romina Abachi, Igor Gilitschenski, Amir-massoud Farahmand

Title: Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Abstract:
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

Paperid:1922 poster

Authors:Chenxi Wang, Linxiao Yang, Zhixian Wang, Liang Sun, Yi Wang

Title: A Non-isotropic Time Series Diffusion Model with Moving Average Transitions

Abstract:
Diffusion models, known for their generative ability, have recently been adapted to time series analysis. Most pioneering works rely on the standard isotropic diffusion, treating each time step and the entire frequency spectrum identically. However, it may not be suitable for time series, which often have more informative low-frequency components. We empirically found that direct application of standard diffusion to time series may cause gradient contradiction during training, due to the rapid decrease of low-frequency information in the diffusion process. To this end, we proposed a novel time series diffusion model, MA-TSD, which utilizes the moving average, a natural low-frequency filter, as the forward transition. Its backward process is accelerable like DDIM and can be further considered a time series super-resolution. Our experiments on various datasets demonstrated MA-TSD's superior performance in time series forecasting and super-resolution tasks.

Paperid:1923 poster

Authors:Gabriel Kasmi, Amandine Brunetto, Thomas Fel, Jayneel Parekh

Title: One Wave To Explain Them All: A Unifying Perspective On Feature Attribution

Abstract:
Feature attribution methods aim to improve the transparency of deep neural networks by identifying the input features that influence a model's decision. Pixel-based heatmaps have become the standard for attributing features to high-dimensional inputs, such as images, audio representations, and volumes. While intuitive and convenient, these pixel-based attributions fail to capture the underlying structure of the data. Moreover, the choice of domain for computing attributions has often been overlooked. This work demonstrates that the wavelet domain allows for informative and meaningful attributions. It handles any input dimension and offers a unified approach to feature attribution. Our method, theWaveletAttributionMethod (WAM), leverages the spatial and scale-localized properties of wavelet coefficients to provide explanations that capture both thewhereandwhatof a model's decision-making process. We show that WAM quantitatively matches or outperforms existing gradient-based methods across multiple modalities, including audio, images, and volumes. Additionally, we discuss how WAM bridges attribution with broader aspects of model robustness and transparency. Project page:https://gabrielkasmi.github.io/wam/

Paperid:1924 poster

Authors:Nikita Tsoy, Ivan Kirev, Negin Rahimiyazdi, Nikola Konstantinov

Title: On the Impact of Performative Risk Minimization for Binary Random Variables

Abstract:
Performativity, the phenomenon where outcomes are influenced by predictions, is particularly prevalent in social contexts where individuals strategically respond to a deployed model. In order to preserve the high accuracy of machine learning models under distribution shifts caused by performativity, Perdomo et al. (2020) introduced the concept of performative risk minimization (PRM). While this framework ensures model accuracy, it overlooks the impact of the PRM on the underlying distributions and the predictions of the model. In this paper, we initiate the analysis of the impact of PRM, by studying performativity for a sequential performative risk minimization problem with binary random variables and linear performative shifts. We formulate two natural measures of impact. In the case of full information, where the distribution dynamics are known, we derive explicit formulas for the PRM solution and our impact measures. In the case of partial information, we provide performative-aware statistical estimators, as well as simulations. Our analysis contrasts PRM to alternatives that do not model data shift and indicates that PRM can have amplified side effects compared to such methods.

Paperid:1925 poster

Authors:Jaeheun Jung, Jaehyuk Lee, Chang-Hae Jung, Hanyoung Kim, Bosung Jung, Donghun Lee

Title: Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition

Abstract:
Shock waves caused by earthquakes can be devastating. Generating realistic earthquake-caused ground motion waveforms help reducing losses in lives and properties, yet generative models for the task tend to generate subpar waveforms. We present High-fidelity Earthquake Groundmotion Generation System (HEGGS) and demonstrate its superior performance using earthquakes from North American, East Asian, and European regions. HEGGS exploits the intrinsic characteristics of earthquake dataset and learns the waveforms using an end-to-end differentiable generator containing conditional latent diffusion model and hi-fidelity waveform construction model. We show the learning efficiency of HEGGS by training it on a single GPU machine and validate its performance using earthquake databases from North America, East Asia, and Europe, using diverse criteria from waveform generation tasks and seismology. Once trained, HEGGS can generate three dimensional E-N-Z seismic waveforms with accurate P/S phase arrivals, envelope correlation, signal-to-noise ratio, GMPE analysis, frequency content analysis, and section plot analysis.

Paperid:1926 poster

Authors:Wendong Zheng, Junyang Chen, Husheng Guo, Wenjian Wang

Title: Can DBNNs Robust to Environmental Noise for Resource-constrained Scenarios?

Abstract:
Recently, the potential of lightweight models for resource-constrained scenarios has garnered significant attention, particularly in safety-critical tasks such as bio-electrical signal classification and B-ultrasound-assisted diagnostic. These tasks are frequently affected by environmental noise due to patient movement artifacts and inherent device noise, which pose significant challenges for lightweight models (e.g., deep binary neural networks (DBNNs)) to perform robust inference. A pertinent question arises: can a well-trained DBNN effectively resist environmental noise during inference? In this study, we find that the DBNN's robustness vulnerability comes from the binary weights and scaling factors. Drawing upon theoretical insights, we propose L1-infinite norm constraints for binary weights and scaling factors, which yield a tighter upper bound compared to existing state-of-the-art (SOTA) methods. Finally, visualization studies show that our approach introduces minimal noise perturbations at the periphery of the feature maps. Our approach outperforms the SOTA method, as validated by several experiments conducted on the bio-electrical and image classification datasets. We hope our findings can raise awareness among researchers about the environmental noise robustness of DBNNs.

Paperid:1927 poster

Authors:Chenrui Wu, Haishuai Wang, Xiang Zhang, Chengqi Zhang, Jiajun Bu

Title: Efficient Personalized Adaptation for Physiological Signal Foundation Model

Abstract:
Time series analysis is crucial across various fields like energy, environment, transportation, finance and health. Deep learning has significantly advanced this field, particularly, the Time Series Foundation Model (TSFM) excels in multiple domains due to extensive pre-training. In this work, we focus on TSFM's challenges in medical practice: limited computing resources and medical data privacy. TSFM variants include fine-tuned models and those pre-trained for rapid deployment on diverse data. There may not be enough computing resources to train physiological signals locally in hospitals, and generalized TSFM is still inferior to task-specific methods on private, imbalanced local data. To address this, we propose PhysioPFM, a framework for efficiently personalizing TSFM. Our approach involves low-rank pre-training on public datasets, generator training by trained LoRA weights, and efficient weight generation via local data. Experimental results demonstrate that integrating generated models with TSFM enhances performance, and transferability, and reduces the need for additional sensitive data training.

Paperid:1928 poster

Authors:Mateo Espinosa Zarlenga, Gabriele Dominici, Pietro Barbiero, Zohreh Shams, Mateja Jamnik

Title: Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Abstract:
In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-levelconcepts(e.g., "stripes", "black") and then predict a task label from those concepts. In particular, we study the impact ofconcept interventions(i.e., operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we termleakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduceMixCEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.

Paperid:1929 poster

Authors:Jonathan Scott, Christoph H. Lampert, David Saulpic

Title: Differentially Private Federated $k$-Means Clustering with Server-Side Data

Abstract:
Clustering is a cornerstone of data analysis that is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days. However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, e.g. on edge devices, and privacy concerns prevent it from being transferred to a central server. To address this challenge, we present FedDP-KMeans, a new algorithm for $k$-means clustering that is fully-federated as well as differentially private. Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.

Paperid:1930 poster

Authors:Shoukai Xu, ZihaoLian, Mingkui Tan, Liu Liu, Zhong Zhang, Peilin Zhao

Title: Test-time Adapted Reinforcement Learning with Action Entropy Regularization

Abstract:
Offline reinforcement learning is widely applied in multiple fields due to its advantages in efficiency and risk control. However, a major problem it faces is the distribution shift between offline datasets and online environments. This mismatch leads to out-of-distribution (OOD) state-action pairs that fall outside the scope of the training data. Therefore, existing conservative training policies may not provide reliable decisions when the test environment deviates greatly from the offline dataset. In this paper, we propose Test-time Adapted Reinforcement Learning (TARL) to address this problem. TARL constructs unsupervised test-time optimization objectives for discrete and continuous control tasks, using test data without depending on environmental rewards. In discrete control tasks, it minimizes the entropy of predicted action probabilities to decrease uncertainty and avoid OOD state-action pairs. For continuous control tasks, it represents and minimizes action uncertainty based on the normal distribution of policy network outputs. Moreover, to prevent model bias caused by overfitting and error accumulation during the test-time update process, TARL enforces a KL divergence constraint between the fine-tuned policy and the original policy. For efficiency, TARL only updates the layer normalization layer parameters during testing. Extensive experiments on popular Atari game benchmarks and the D4RL dataset demonstrate the superiority of our method. Our method achieved a significant improvement over CQL, with a 13.6% episode return relative increase on the hopper-expert-v2 task.

Paperid:1931 poster

Authors:Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R Green, Evan Shelhamer, Hannah Kerner, David Rolnick

Title: Galileo: Learning Global & Local Features of Many Remote Sensing Modalities

Abstract:
We introduce a highly multimodal transformer to represent many remote sensing modalities - multispectral optical, synthetic aperture radar, elevation, weather, pseudo-labels, and more - across space and time. These inputs are useful for diverse remote sensing tasks, such as crop mapping and flood detection. However, learning shared representations of remote sensing data is challenging, given the diversity of relevant data modalities, and because objects of interest vary massively in scale, from small boats (1-2 pixels and fast) to glaciers (thousands of pixels and slow). We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling. Our dual global and local contrastive losses differ in their targets (deep representations vs. shallow input projections) and masking strategies (structured vs. not). Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks.

Paperid:1932 poster

Authors:Nolan Koblischke, Hyunseok Jang, Kristen Menou, Mohamad Ali-Dib

Title: Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

Abstract:
Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.

Paperid:1933 poster

Authors:Ruoxin Yuan, Guanhua Fang

Title: Residual TPP: A Unified Lightweight Approach for Event Stream Data Analysis

Abstract:
This work introduces Residual TPP, a novel, unified, and lightweight approach for analyzing event stream data. It leverages the strengths of both simple statistical TPPs and expressive neural TPPs to achieve superior performance. Specifically, we propose the Residual Events Decomposition (RED) technique in temporal point processes, which defines a weight function to quantify how well the intensity function captures the event characteristics. The RED serves as a flexible, plug-and-play module that can be integrated with any TPP model in a wide range of tasks. It enables the identification of events for which the intensity function provides a poor fit, referred to as residual events. By combining RED with a Hawkes process, we capture the self-exciting nature of the data and identify residual events. Then an arbitrary neural TPP is employed to take care of residual events. Extensive experimental results demonstrate that Residual TPP consistently achieves state-of-the-art goodness-of-fit and prediction performance in multiple domains and offers significant computational advantages as well.

Paperid:1934 poster

Authors:Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael Lyu, Maarten Sap

Title: On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents

Abstract:
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents, each focusing on a specific domain. However, the impact of clumsy or even malicious agents—those who frequently make errors in their tasks—on the overall performance of the system remains underexplored. This paper investigates: (1) What is the resilience of various system structures (e.g., A→B→C, A↔B↔C) under faulty agents, on different downstream tasks? (2) How can we increase system resilience to defend against these agents? To simulate faulty agents, we propose two approaches—AutoTransform and AutoInject—which introduce mistakes into the agents' responses. Experiments on four downstream tasks using six systems show that the "hierarchical" structure, i.e., A�?B↔C), exhibits superior resilience with the lowest performance drop of 5.5%, compared to 10.5% and 23.7% of other two structures. To further improve resilience, we introduce (1) Challenger, that introduces a mechanism for each agent to challenge others' outputs, and (2) Inspector, an additional agent to review and correct messages, recovering up to 96.4% errors made by faulty agents. Our code and data are available athttps://github.com/CUHK-ARISE/MAS-Resilience.

Paperid:1935 poster

Authors:Xihang Yue, Yi Yang, Linchao Zhu

Title: Holistic Physics Solver: Learning PDEs in a Unified Spectral-Physical Space

Abstract:
Recent advances in operator learning have produced two distinct approaches for solving partial differential equations (PDEs): attention-based methods offering point-level adaptability but lacking spectral constraints, and spectral-based methods providing domain-level continuity priors but limited in local flexibility. This dichotomy has hindered the development of PDE solvers with both strong flexibility and generalization capability. This work introduces Holistic Physics Mixer (HPM), a novel framework that bridges this gap by integrating spectral and physical information in a unified space. HPM unifies both approaches as special cases while enabling more powerful spectral-physical interactions beyond either method alone. This enables HPM to inherit both the strong generalization of spectral methods and the flexibility of attention mechanisms while avoiding their respective limitations. Through extensive experiments across diverse PDE problems, we demonstrate that HPM consistently outperforms state-of-the-art methods in both accuracy and computational efficiency, while maintaining strong generalization capabilities with limited training data and excellent zero-shot performance on unseen resolutions.

Paperid:1936 poster

Authors:Zhemin Li, Liyuan Ma, Hongxia Wang, Yaoyun Zeng, Xiaolong Han

Title: Learning Input Encodings for Kernel-Optimal Implicit Neural Representations

Abstract:
Implicit Neural Representations (INRs) rely heavily on architectural choices for good generalization. Developing theoretically grounded approaches for architecture design remains an active area of research. Via theoretical analysis of the infinite-width limit, we establish a methodology that characterizes INR's generalization by means of kernel alignment. We first formulate the optimal kernel that minimizes pointwise expected squared error, then demonstrate that the Neural Tangent Kernel of the composed function (INR with input encoding) can approximate any positive semidefinite dot-product kernels through input feature mapping adjustments. Building upon these insights, we propose a Kernel Alignment Regularizer (KAR) that naturally integrates with existing INR systems to enhance kernel alignment. We further develop Plug-in Encoding for Aligned Kernels (PEAK) to refine INR models with KAR using learnable input encoding. This work contributes to the ongoing research efforts in bridging theory and practice for principled INR architecture design. Code is available athttps://github.com/lizhemin15/KAR.

Paperid:1937 poster

Authors:Alexander Moreno, Justin Xiao, Jonathan Mei

Title: The Price of Linear Time: Error Analysis of Structured Kernel Interpolation

Abstract:
Structured Kernel Interpolation (SKI) scales Gaussian Processes (GPs) by approximating the kernel matrix via inducing point interpolation, achieving linear computational complexity. However, it lacks rigorous theoretical error analysis. This paper bridges this gap by proving error bounds for the SKI Gram matrix and examining their effect on hyperparameter estimation and posterior inference. We further provide a practical guide to selecting the number of inducing points under convolutional cubic interpolation: they should grow as (n^{d/3}) for error control. Crucially, we identify two dimensionality regimes for the SKI Gram matrix spectral norm error vs. complexity trade-off. For (d<3), \textit{any} error tolerance can achieve linear time for sufficiently large sample size. For (d\geq 3), the error must \textit{increase} with sample size for our guarantees to hold. Our analysis provides key insights into SKI's scalability-accuracy trade-offs, establishing precise conditions for achieving linear-time GP inference with controlled error.

Paperid:1938 poster

Authors:Xinsong Ma, Jie Wu, Weiwei Liu

Title: A Closer Look at Generalized BH Algorithm for Out-of-Distribution Detection

Abstract:
Out-of-distribution (OOD) detection is a crucial task in reliable and safety-critical applications. Previous studies primarily focus on developing score functions while neglecting the design of decision rules based on these scores. A recent work (Ma et al., 2024) is the first to highlight this issue and proposes the generalized BH (g-BH) algorithm to address it. The g-BH algorithm relies on empirical p-values, with the calibrated set playing a central role in their computation. However, the impact of calibrated set on the performance of g-BH algorithm has not been thoroughly investigated. This paper aims to uncover the underlying mechanisms between them. Theoretically, we demonstrate that conditional expectation of true positive rate (TPR) on calibrated set for the g-BH algorithm follows a beta distribution, which depends on the prescribed level and size of calibrated set. This indicates that a small calibrated set tends to degrade the performance of g-BH algorithm. To address the limitation of g-BH algorithm on small calibrated set, we propose a novel ensemble g-BH (eg-BH) algorithm which integrates various empirical p-values for making decisions. Finally, extensive experimental results validate the effectiveness of our theoretical findings and demonstrate the superiority of our method over g-BH algorithm on small calibrated set.

Paperid:1939 poster

Authors:Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu

Title: Concept-Centric Token Interpretation for Vector-Quantized Generative Models

Abstract:
Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs---the codebook of discrete tokens---is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available athttps://github.com/YangTianze009/CORTEX.

Paperid:1940 poster

Authors:Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan

Title: Curse of High Dimensionality Issue in Transformer for Long Context Modeling

Abstract:
Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to redundant attention computations: while attention weights are often sparse, all tokens consume equal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a supervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a group coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose Dynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.

Paperid:1941 poster

Authors:Alex Crane, Thomas Stanley, Blair D. Sullivan, Nate Veldt

Title: Edge-Colored Clustering in Hypergraphs: Beyond Minimizing Unsatisfied Edges

Abstract:
We consider a framework for clustering edge-colored hypergraphs, where the goal is to cluster (equivalently, tocolor) objects based on the primary type of multiway interactions they participate in. One well-studied objective is to color nodes to minimize the number ofunsatisfiedhyperedges---those containing one or more nodes whose color does not match the hyperedge color. We motivate and present advances for several directions that extend beyond this minimization problem. We first provide new algorithms for maximizingsatisfiededges, which is the same at optimality but is much more challenging to approximate, with all prior work restricted to graphs. We develop the first approximation algorithm for hypergraphs, and then refine it to improve the best-known approximation factor for graphs. We then introduce new objective functions that incorporate notions of balance and fairness, and provide new hardness results, approximations, and fixed-parameter tractability results.

Paperid:1942 poster

Authors:Xin Wang, Shengfei Lyu, Chi Luo, Xiren Zhou, Huanhuan Chen

Title: Variational Counterfactual Intervention Planning to Achieve Target Outcomes

Abstract:
A key challenge in personalized healthcare is identifying optimal intervention sequences to guide temporal systems toward target outcomes, a novel problem we formalize as counterfactual target achievement. In addressing this problem, directly adopting counterfactual estimation methods face compounding errors due to the unobservability of counterfactuals. To overcome this, we propose Variational Counterfactual Intervention Planning (VCIP), which reformulates the problem by modeling the conditional likelihood of achieving target outcomes, implemented through variational inference. By leveraging the g-formula to bridge the gap between interventional and observational log-likelihoods, VCIP enables reliable training from observational data. Experiments on both synthetic and real-world datasets show that VCIP significantly outperforms existing methods in target achievement accuracy.

Paperid:1943 poster

Authors:Ruixiang ZHANG, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua M. Susskind, Navdeep Jaitly

Title: Target Concrete Score Matching: A Holistic Framework for Discrete Diffusion

Abstract:
Discrete diffusion is a promising framework for modeling and generating discrete data. In this work, we present Target Concrete Score Matching (TCSM), a novel and versatile objective for training and fine-tuning discrete diffusion models. TCSM provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. Furthermore, the same TCSM objective extends to post-training of discrete diffusion models, including fine-tuning using reward functions or preference data, and distillation of knowledge from pre-trained autoregressive models. These new capabilities stem from the core idea of TCSM, estimating the concrete score of the target distribution, which resides in the original (clean) data space. This allows seamless integration with reward functions and pre-trained models, which inherently only operate in the clean data space rather than the noisy intermediate spaces of diffusion processes. Our experiments on language modeling tasks demonstrate that TCSM matches or surpasses current methods. Additionally, TCSM is versatile, applicable to both pre-training and post-training scenarios, offering greater flexibility and sample efficiency.

Paperid:1944 poster

Authors:Keyue Qiu, Yuxuan Song, Zhehuan Fan, Peidong Liu, Zhe Zhang, Mingyue Zheng, Hao Zhou, Wei-Ying Ma

Title: Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule

Abstract:
Structure-Based Drug Design (SBDD) is crucial for identifying bioactive molecules. Recent deep generative models are faced with challenges in geometric structure modeling. A major bottleneck lies in the twisted probability path of multi-modalities—continuous 3D positions and discrete 2D topologies—which jointly determine molecular geometries. By establishing the fact that noise schedules decide the Variational Lower Bound (VLB) for the twisted probability path, we propose VLB-Optimal Scheduling (VOS) strategy in this under-explored area, which optimizes VLB as a path integral for SBDD. Our model effectively enhances molecular geometries and interaction modeling, achieving state-of-the-art PoseBusters passing rate of 95.9% on CrossDock, more than 10% improvement upon strong baselines, while maintaining high affinities and robust intramolecular validity evaluated on held-out test set. Code is available athttps://github.com/AlgoMole/MolCRAFT.

Paperid:1945 poster

Authors:Haoqi Wu, Wei Dai, Wang Li, Qiang Yan

Title: Cape: Context-Aware Prompt Perturbation Mechanism with Differential Privacy

Abstract:
Large Language Models (LLMs) have gained significant popularity due to their remarkable capabilities in text understanding and generation. However, despite their widespread deployment in inference services such as ChatGPT, concerns about the potential leakage of sensitive user data have arisen. Existing solutions primarily rely on privacy-enhancing technologies to mitigate such risks, facing the trade-off among efficiency, privacy, and utility. To narrow this gap, we propose Cape, a context-aware prompt perturbation mechanism based on differential privacy, to enable efficient inference with an improved privacy-utility trade-off. Concretely, we introduce a hybrid utility function that better captures the token similarity. Additionally, we propose a bucketized sampling mechanism to handle large sampling space, which might lead to long-tail phenomenons. Extensive experiments across multiple datasets, along with ablation studies, demonstrate that Cape achieves a better privacy-utility trade-off compared to prior state-of-the-art works.

Paperid:1946 poster

Authors:Jinghan Li, Zhicheng Sun, Yadong MU

Title: Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling

Abstract:
In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions to long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with improved scaling w.r.t. inference-time computation. Code is available athttps://github.com/anonymous-icml-2025/equilibrium-planner.

Paperid:1947 poster

Authors:Savelii Chezhegov, Klyukin Yaroslav, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horváth, Martin Takáč, Eduard Gorbunov

Title: Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

Abstract:
Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. We extend our results to the case of AdaGrad/Adam with delayed stepsizes. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.

Paperid:1948 poster

Authors:Yuheng Jing, Kai Li, Bingyun Liu, Ziwen Zhang, Haobo Fu, QIANG FU, Junliang Xing, Jian Cheng

Title: Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement

Abstract:
Offline Opponent Modeling (OOM) aims to learn an adaptive autonomous agent policy that dynamically adapts to opponents using an offline dataset from multi-agent games. Previous work assumes that the dataset is optimal. However, this assumption is difficult to satisfy in the real world. When the dataset is suboptimal, existing approaches struggle to work. To tackle this issue, we propose a simple and general algorithmic improvement framework, Truncated Q-driven Instant Policy Refinement (TIPR), to handle the suboptimality of OOM algorithms induced by datasets. The TIPR framework is plug-and-play in nature. Compared to original OOM algorithms, it requires only two extra steps: (1) Learn a horizon-truncated in-context action-value function, namely Truncated Q, using the offline dataset. The Truncated Q estimates the expected return within a fixed, truncated horizon and is conditioned on opponent information. (2) Use the learned Truncated Q to instantly decide whether to perform policy refinement and to generate policy after refinement during testing. Theoretically, we analyze the rationale of Truncated Q from the perspective of No Maximization Bias probability. Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets.

Paperid:1949 poster

Authors:Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu

Title: Sample-specific Noise Injection for Diffusion-based Adversarial Purification

Abstract:
Diffusion-based purification(DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t*$ for all samples in existing methods. In this paper, we discover that an optimal $t*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocatedistinct noise levels to different samplesin DBP methods. Our code is available at:https://github.com/tmlr-group/SSNI.

Paperid:1950 poster

Authors:Yuan Li, Zhengzhong Liu, Eric P. Xing

Title: Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

Abstract:
Optimizing data mixtures for supervised fine-tuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.

Paperid:1951 poster

Authors:Dhruv Rohatgi, Tanya Marwah, Zachary Chase Lipton, Jianfeng Lu, Ankur Moitra, Andrej Risteski

Title: Towards characterizing the value of edge embeddings in Graph Neural Networks

Abstract:
Graph neural networks (GNNs) are the dominant approach to solving machine learning problems defined over graphs. Despite much theoretical and empirical work in recent years, our understanding of finer-grained aspects of architectural design for GNNs remains impoverished. In this paper, we consider the benefits of architectures that maintain and update edge embeddings. On the theoretical front, under a suitable computational abstraction for a layer in the model, as well as memory constraints on the embeddings, we show that there are natural tasks on graphical models for which architectures leveraging edge embeddings can be much shallower. Our techniques are inspired by results on time-space tradeoffs in theoretical computer science. Empirically, we show architectures that maintain edge embeddings almost always improve on their node-based counterparts---frequently significantly so in topologies that have "hub" nodes.

Paperid:1952 poster

Authors:Ilias Diakonikolas, Giannis Iakovidis, Daniel Kane, Thanasis Pittas

Title: Efficient Multivariate Robust Mean Estimation Under Mean-Shift Contamination

Abstract:
We study the algorithmic problem of robust mean estimation of an identity covariance Gaussian in the presence of mean-shift contamination. In this contamination model, we are given a set of points in $\mathbb{R}^d$ generated i.i.d. via the following process. For a parameter $\alpha<1/2$, the $i$-th sample $x_i$ is obtained as follows: with probability $1-\alpha$, $x_i$ is drawn from $\mathcal{N}(\mu, I)$, where $\mu \in \mathbb{R}^d$ is the target mean; and with probability $\alpha$, $x_i$ is drawn from $\mathcal{N}(z_i, I)$, where $z_i$ is unknown and potentially arbitrary. Prior work characterized the information-theoretic limits of this task. Specifically, it was shown that�?in contrast to Huber contamination�?in the presence of mean-shift contamination consistent estimation is possible. On the other hand, all known robust estimators in the mean-shift model have running times exponential in the dimension. Here we give the first computationally efficient algorithm for high-dimensional robust mean estimation with mean-shift contamination that can tolerate a constant fraction of outliers. In particular, our algorithm has near-optimal sample complexity, runs in sample-polynomial time, and approximates the target mean to any desired accuracy. Conceptually, our result contributes to a growing body of work that studies inference with respect to natural noise models lying in between fully adversarial and random settings.

Paperid:1953 poster

Authors:Lucas Fabian Naumann, Jannik Irmai, Bjoern Andres

Title: A Sub-Problem Quantum Alternating Operator Ansatz for Correlation Clustering

Abstract:
The Quantum Alternating Operator Ansatz (QAOA) is a hybrid quantum-classical variational algorithm for approximately solving combinatorial optimization problems on Noisy Intermediate-Scale Quantum (NISQ) devices. Although it has been successfully applied to a variety of problems, there is only limited work on correlation clustering due to the difficulty of modelling the problem constraints with the ansatz. Motivated by this, we present a generalization of QAOA that is more suitable for this problem. In particular, we modify QAOA in two ways: Firstly, we use nucleus sampling for the computation of the expected cost. Secondly, we split the problem into sub-problems, solving each individually with QAOA. We call this generalization the Sub-Problem Quantum Alternating Operator Ansatz (SQAOA) and show theoretically that optimal solutions for correlation clustering instances can be obtained with certainty when the depth of the ansatz tends to infinity. Further, we show experimentally that SQAOA achieves better approximation ratios than QAOA for correlation clustering, while using only one qubit per node of the respective problem instance and reducing the runtime (of simulations).

Paperid:1954 poster

Authors:Nathaniel Lahn, Sharath Raghvendra, Emma Saarinen, Pouyan Shirzadian

Title: Scalable Approximation Algorithms forp-Wasserstein Distance and Its Variants

Abstract:
The $p$-Wasserstein distance measures the cost of optimally transporting one distribution to another, where the cost of moving a unit mass from $a$ to $b$ is the $p^{th}$ power of the ground distance $\mathrm{d}(a,b)$ between them. Despite its strong theoretical properties, its use in practice -- especially for $p \ge 2$ -- is limited due to two key challenges: sensitivity to noise and a lack of scalable algorithms. We identify noise sensitivity as a key reason why some existing approximation algorithms for $p=1$ fail to generalize to $p \ge 2$ and then present new algorithms for approximating the $p$-Wasserstein distance and its variant. First, when $\mathrm{d}(\cdot,\cdot)$ is a metric, for any constant $p \ge 2$, we present a novel relative $O(\log n)$-approximation algorithm to compute the $p$-Wasserstein distance between any two discrete distributions of size $n$. The algorithm runs in $O(n^2 \log U\log \Delta\log n)$ time, where $\log U$ is the bit-length of the input probabilities and $\Delta$ is the ratio of the largest to the smallest pairwise distance. We use $p$ hierarchically well-separated trees to define a distance that approximates the $p$-Wasserstein cost within a factor of $O(\log n)$ and then present a simple primal-dual algorithm to compute the $p$-Wasserstein cost with respect to this distance. Second, due to the noise sensitivity of the $p$-Wasserstein distance, we show that existing combinatorial approaches require $\Omega(n^2/\delta^p)$ time to approximate the $p$-Wasserstein distance within an additive error of $\delta$. In contrast, we show that, for any arbitrary distance $\mathrm{d}(\cdot,\cdot)$, a recent noise-resistant variant of the $p$-Wasserstein distance, called the $p$-RPW distance, can be approximated in $O(n^2/\delta^3)$ time.

Paperid:1955 poster

Authors:Yuliang Liu, Junjie Lu, Chaofeng Qu, Zhaoling Chen, Zefan Cai, Jason Klein Liu, Chonghan Liu, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin

Title: AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

Abstract:
Current approaches for training Process Reward Models (PRMs) often involve deconposing responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length to a fixed size. These approaches overlook the fact that certain words don't usually indicate true decision points. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word, offering more information on decision-making at each step, improving downstream tasks like reward model training. Moreover, our method requires no manual annotation. Experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation show that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. We also provide a thorough analysis and case study on its performance, transferability, and generalization capabilities. We provide our code onhttps://github.com/Lux0926/ASPRM.

Paperid:1956 poster

Authors:Juncheol Shin, Minsang Seok, Seonggon Kim, Eunhyeok Park

Title: Merge-Friendly Post-Training Quantization for Multi-Target Domain Adaptation

Abstract:
Model merging has emerged as a powerful technique for combining task-specific weights, achieving superior performance in multi-target domain adaptation. However, when applied to practical scenarios, such as quantized models, new challenges arise. In practical scenarios, quantization is often applied to target-specific data, but this process restricts the domain of interest and introduces discretization effects, making model merging highly non-trivial. In this study, we analyze the impact of quantization on model merging through the lens of error barriers. Leveraging these insights, we propose a novel post-training quantization, HDRQ - Hessian and distant regularizing quantization - that is designed to consider model merging for multi-target domain adaptation. Our approach ensures that the quantization process incurs minimal deviation from the source pre-trained model while flattening the loss surface to facilitate smooth model merging. To our knowledge, this is the first study on this challenge, and extensive experiments confirm its effectiveness.

Paperid:1957 poster

Authors:Jerome Garnier-Brun, Marc Mezard, Emanuele Moscato, Luca Saglietti

Title: How Transformers Learn Structured Data: Insights From Hierarchical Filtering

Abstract:
Understanding the learning process and the embedded computation in transformers is becoming a central goal for the development of interpretable AI. In the present study, we introduce a hierarchical filtering procedure for data models of sequences on trees, allowing us to hand-tune the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformers can approximate the exact inference algorithm when trained on root classification and masked language modeling tasks, and studyhowthis computation is discovered and implemented. We find that correlations at larger distances, corresponding to increasing layers of the hierarchy, are sequentially included by the network during training. By comparing attention maps from models trained with varying degrees of filtering and by probing the different encoder levels, we find clear evidence of a reconstruction of correlations on successive length scales corresponding to the various levels of the hierarchy, which we relate to a plausible implementation of the exact inference algorithm within the same architecture.

Paperid:1958 poster

Authors:Zheng Gong, Ying Sun

Title: Outlier-Aware Post-Training Quantization for Discrete Graph Diffusion Models

Abstract:
Discrete Graph Diffusion Models (DGDMs) mark a pivotal advancement in graph generation, effectively preserving sparsity and structural integrity, thereby enhancing the learning of graph data distributions for diverse generative applications. Despite their potential, DGDMs are computationally intensive due to the numerous low-parameter yet high-computation operations, thereby increasing the need of inference acceleration. A promising solution to mitigate this issue is model quantization. However, existing quantization techniques for Image Diffusion Models (IDMs) face limitations in DGDMs due to differing diffusion processes, while Large Language Model (LLM) quantization focuses on reducing memory access latency of loading large parameters, unlike DGDMs, where inference bottlenecks are computations due to smaller model sizes. To fill this gap, we introduce Bit-DGDM, a post-training quantization framework for DGDMs which incorporates two novel ideas: (i) sparse-dense activation quantization sparsely modeling the activation outliers through adaptively selected, data-free thresholds in full-precision and quantizing the remaining to low-bit, and (ii) ill-conditioned low-rank decomposition decomposing the weights into low-rank component enable faster inference and an $\alpha$-sparsity matrix that models outliers. Extensive experiments demonstrate that Bit-DGDM not only reducing the memory usage from the FP32 baseline by up to $2.8\times$ and achieve up to $2.5\times$ speedup, but also achieve comparable performance to ultra-low precision of up to 4-bit.

Paperid:1959 poster

Authors:Arya Mazumdar, Neha Sangwan

Title: Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements

Abstract:
We consider the problem ofexactrecovery of a $k$-sparse binary vector from generalized linear measurements (such aslogistic regression). We analyze thelinear estimationalgorithm (Plan, Vershynin, Yudovina, 2017), and also show information theoretic lower bounds on the number of required measurements. As a consequence of our results, for noisy one bit quantized linear measurements ($\mathsf{1bCS}$), we obtain a sample complexity of $O((k+\sigma^2)\log{n})$, where $\sigma^2$ is the noise variance. This is shown to be optimal due to the information theoretic lower bound. We also obtain tight sample complexity characterization for logistic regression. Since $\mathsf{1bCS}$ is a strictly harder problem than noisy linear measurements ($\mathsf{SparseLinearReg}$) because of added quantization, the same sample complexity is achievable for $\mathsf{SparseLinearReg}$. While this sample complexity can be obtained via the popular lasso algorithm, linear estimation is computationally more efficient. Our lower bound holds for any set of measurements for $\mathsf{SparseLinearReg}$ (similar bound was known for Gaussian measurement matrices) and is closely matched by the maximum-likelihood upper bound. For $\mathsf{SparseLinearReg}$, it was conjectured in Gamarnik and Zadik, 2017 that there is a statistical-computational gap and the number of measurements should be at least $(2k+\sigma^2)\log{n}$ for efficient algorithms to exist. It is worth noting that our results imply that there is no such statistical-computational gap for $\mathsf{1bCS}$ and logistic regression.

Paperid:1960 poster

Authors:Max Wilcoxson, Qiyang Li, Kevin Frans, Sergey Levine

Title: Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Abstract:
Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.

Paperid:1961 poster

Authors:Zhining Liu, Ze Yang, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Yada Zhu, Hendrik Hamann, Jingrui He, Hanghang Tong

Title: Breaking Silos: Adaptive Model Fusion Unlocks Better Time Series Forecasting

Abstract:
Time-series forecasting plays a critical role in many real-world applications. Although increasingly powerful models have been developed and achieved superior results on benchmark datasets, through a fine-grained sample-level inspection, we find that (i) no single model consistently outperforms others across different test samples, but instead (ii) each model excels in specific cases. These findings prompt us to explore how to adaptively leverage the distinct strengths of various forecasting models for different samples. We introduce TimeFuse, a framework for collective time-series forecasting with sample-level adaptive fusion of heterogeneous models. TimeFuse utilizes meta-features to characterize input time series and trains a learnable fusor to predict optimal model fusion weights for any given input. The fusor can leverage samples from diverse datasets for joint training, allowing it to adapt to a wide variety of temporal patterns and thus generalize to new inputs, even from unseen datasets. Extensive experiments demonstrate the effectiveness of TimeFuse in various long-/short-term forecasting tasks, achieving near-universal improvement over the state-of-the-art individual models. Code is available athttps://github.com/ZhiningLiu1998/TimeFuse.

Paperid:1962 poster

Authors:Quan Xiao, Hui Yuan, A F M Saif, Gaowen Liu, Ramana Rao Kompella, Mengdi Wang, Tianyi Chen

Title: A First-order Generative Bilevel Optimization Framework for Diffusion Models

Abstract:
Diffusion models, which iteratively denoise data samples to synthesize high-quality outputs, have achieved empirical success across domains. However, optimizing these models for downstream tasks often involves nested bilevel structures, such as tuning hyperparameters for fine-tuning tasks or noise schedules in training dynamics, where traditional bilevel methods fail due to the infinite-dimensional probability space and prohibitive sampling costs. We formalize this challenge as a generative bilevel optimization problem and address two key scenarios: (1) fine-tuning pre-trained models via an inference-only lower-level solver paired with a sample-efficient gradient estimator for the upper level, and (2) training diffusion model from scratch with noise schedule optimization by reparameterizing the lower-level problem and designing a computationally tractable gradient estimator. Our first-order bilevel framework overcomes the incompatibility of conventional bilevel methods with diffusion processes, offering theoretical grounding and computational practicality. Experiments demonstrate that our method outperforms existing fine-tuning and hyperparameter search baselines.

Paperid:1963 poster

Authors:Anish Acharya, Sujay Sanghavi, Alex Dimakis, Inderjit S Dhillon

Title: Geometric Median (GM) Matching for Robust k-Subset Selection from Noisy Data

Abstract:
Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, this work proposes Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages the Geometric Median (GM) , a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a $k$-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved $\mathcal{O}(1/k)$ convergence rate, outperforming $\mathcal{O}(1/\sqrt{k})$ scaling of uniform sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning.

Paperid:1964 poster

Authors:Yuanjie Shi, Hooman Shahrokhi, Xuesong Jia, Xiongzhi Chen, Jana Doppa, Yan Yan

Title: Direct Prediction Set Minimization via Bilevel Conformal Classifier Training

Abstract:
Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a black-box classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating conformal principles into the training process of deep classifiers to directly minimize the size of prediction sets. We formulate conformal training as a bilevel optimization problem and propose the {\em Direct Prediction Set Minimization (DPSM)} algorithm to solve it. The key insight behind DPSM is to minimize a measure of the prediction set size (upper level) that is conditioned on the learned quantile of conformity scores (lower level). We analyze that DPSM has a learning bound of $O(1/\sqrt{n})$ (with $n$ training samples), while prior conformal training methods based on stochastic approximation for the quantile has a bound of $\Omega(1/s)$ (with batch size $s$ and typically $s \ll \sqrt{n}$). Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with $20.46\%\downarrow$ in the prediction set size and validates our theory.

Paperid:1965 poster

Authors:Julius von Rohrscheidt, Bastian Rieck

Title: Diss-l-ECT: Dissecting Graph Data with Local Euler Characteristic Transforms

Abstract:
The Euler Characteristic Transform (ECT) is an efficiently computable geometrical-topological invariant that characterizes the global shape of data. In this paper, we introduce the local Euler Characteristic Transform ($\ell$-ECT), a novel extension of the ECT designed to enhance expressivity and interpretability in graph representation learning. Unlike traditional Graph Neural Networks (GNNs), which may lose critical local details through aggregation, the $\ell$-ECT provides a lossless representation of local neighborhoods. This approach addresses key limitations in GNNs by preserving nuanced local structures while maintaining global interpretability. Moreover, we construct a rotation-invariant metric based on $\ell$-ECTs for spatial alignment of data spaces. Our method demonstrates superior performance compared to standard GNNs on various benchmarking node classification tasks, while also offering theoretical guarantees of its effectiveness.

Paperid:1966 poster

Authors:Michał Wiliński, Mononito Goswami, Willa Potosnak, Nina Żukowska, Artur Dubrawski

Title: Exploring Representations and Interventions in Time Series Foundation Models

Abstract:
Time series foundation models (TSFMs) promise to be powerful tools for a wide range of applications. However, their internal representations and learned concepts are still not well understood. In this study, we investigate the structure and redundancy of representations across various TSFMs, examining the self-similarity of model layers within and across different model sizes. This analysis reveals block-like redundancy in the representations, which can be utilized for informed pruning to improve inference speed and efficiency. We also explore the concepts learned by these models, such as periodicity and trends. We demonstrate how conceptual priors can be derived from TSFM representations and leveraged to steer its outputs toward concept-informed predictions. Our work bridges representational analysis from language and vision models to TSFMs, offering new methods for building more computationally efficient and transparent TSFMs.

Paperid:1967 poster

Authors:Thalles Silva, Helio Pedrini, Adín Ramírez Rivera

Title: Self-Organizing Visual Prototypes for Non-Parametric Representation Learning

Abstract:
We present Self-Organizing Visual Prototypes (SOP), a new training technique for unsupervised visual feature learning. Unlike existing prototypical self-supervised learning (SSL) methods that rely on a single prototype to encode all relevant features of a hidden cluster in the data, we propose the SOP strategy. In this strategy, a prototype is represented by many semantically similar representations, or support embeddings (SEs), each containing a complementary set of features that together better characterize their region in space and maximize training performance. We reaffirm the feasibility of non-parametric SSL by introducing novel non-parametric adaptations of two loss functions that implement the SOP strategy. Notably, we introduce the SOP Masked Image Modeling (SOP-MIM) task, where masked representations are reconstructed from the perspective of multiple non-parametric local SEs. We comprehensively evaluate the representations learned using the SOP strategy on a range of benchmarks, including retrieval, linear evaluation, fine-tuning, and object detection. Our pre-trained encoders achieve state-of-the-art performance on many retrieval benchmarks and demonstrate increasing performance gains with more complex encoders.

Paperid:1968 poster

Authors:zongzhen yang, Binhang Qi, Hailong Sun, Wenrui Long, Ruobing Zhao, Xiang Gao

Title: CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging

Abstract:
Model merging based on task vectors, i.e., the parameter differences between fine-tuned models and a shared base model, provides an efficient way to integrate multiple task-specific models into a multitask model without retraining. Recent works have endeavored to address the conflicts between task vectors, one of the significant challenges faced by model merging, through sparsification; however, two issues significantly limit their performance:high parameter overlapandunbalanced weight distribution. To address these issues, we propose a simple yet effective framework calledCABS(Conflict-Aware and Balanced Sparsification), consisting ofConflict-Aware Sparsification (CA) andBalancedSparsification (BS). CA reduces parameter overlap by applying masks during sequential pruning, ensuring that each task vector retains distinct, non-overlapping parameters. BS leverages $n$:$m$ pruning to preserve critical weights while maintaining an even distribution across layers. Our comprehensive experiments demonstrate that CABS outperforms state-of-the-art methods across diverse tasks and model sizes.

Paperid:1969 poster

Authors:Jingwen Fu, Nanning Zheng

Title: On the Statistical Mechanisms of Distributional Compositional Generalization

Abstract:
Distributional Compositional Generalization (DCG) refers to the ability to tackle tasks from new distributions by leveraging the knowledge of concepts learned from supporting distributions. In this work, we aim to explore the statistical mechanisms of DCG, which have been largely overlooked in previous studies. By statistically formulating the problem, this paper seeks to address two key research questions: 1) Can a method to one DCG problem be applicable to another? 2) What statistical properties can indicate a learning algorithm's capacity for knowledge composition in DCG tasks? \textbf{To address the first question}, an invariant measure is proposed to provide a dimension where all different methods converge. This measure underscores the critical role of data in enabling improvements without trade-offs. \textbf{As for the second question}, we reveal that by decoupling the impacts of insufficient data and knowledge composition, the ability of the learning algorithm to compose knowledge relies on the compatibility and sensitivity between the learning algorithm and the composition rule. In summary, the statistical analysis of the generalization mechanisms provided in this paper deepens our understanding of compositional generalization, offering a complementary evidence on the importance of data in DCG task.

Paperid:1970 poster

Authors:Yafei YANG, Zihui Zhang, Bo Yang

Title: unMORE: Unsupervised Multi-Object Segmentation via Center-Boundary Reasoning

Abstract:
We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse. Our code and data are available athttps://github.com/vLAR-group/unMORE.

Paperid:1971 poster

Authors:Carlos Stein Brito

Title: Cross-regularization: Adaptive Model Complexity through Validation Gradients

Abstract:
Model regularization requires extensive manual tuning to balance complexity against overfitting. Cross-regularization resolves this tradeoff by computing validation gradients that directly adapt regularization parameters during training. The method splits parameter optimization - training data guides feature learning while validation data shapes complexity controls - converging provably to cross-validation optima with computational cost scaling only in regularization dimension. When implemented through noise injection in neural networks, this approach reveals striking patterns: unexpectedly high noise tolerance and architecture-specific regularization that emerges organically during training. Beyond complexity control, the framework integrates seamlessly with data augmentation and uncertainty calibration while maintaining single-run efficiency through a simple gradient-based approach.

Paperid:1972 poster

Authors:Chris Hays, Manish Raghavan

Title: Double Machine Learning for Causal Inference under Shared-State Interference

Abstract:
Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certainshared states, like prices, algorithmic recommendations or social signals. We formalize this structure, calling it shared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals' potential outcomes are independent conditional on the shared state. We then prove an extension of a double machine learning (DML) theorem providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).

Paperid:1973 poster

Authors:Yong Wu, Yanwei Fu, Shouyan Wang, Xinwei Sun

Title: Bivariate Causal Discovery with Proxy Variables: Integral Solving and Beyond

Abstract:
Bivariate causal discovery is challenging when unmeasured confounders exist. To adjust for the bias, previous methods employed the proxy variable (i.e., negative control outcome (NCO)) to test the treatment-outcome relationship through integral equations -- and assumed that violation of this equation indicates the causal relationship. Upon this, they could establish asymptotic properties for causal hypothesis testing. However, these methods either relied on parametric assumptions or required discretizing continuous variables, which may lead to information loss. Moreover, it is unclear when this underlying integral-related assumption holds, making it difficult to justify the utility in practice. To address these problems, we first consider the scenario where only NCO is available. We propose a novel non-parametric procedure, which enjoys asymptotic properties and preserves more information. Moreover, we find that when NCO affects the outcome, the above integral-related assumption may not hold, rendering the causal relation unidentifiable. Informed by this, we further consider the scenario when the negative control exposure (NCE) is also available. In this scenario, we construct another integral restriction aided by this proxy, which can discover causation when NCO affects the outcome. We demonstrate these findings and the effectiveness of our proposals through comprehensive numerical studies.

Paperid:1974 poster

Authors:Rickard Brüel Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon

Title: Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Abstract:
Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRAs. We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 1000 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 80% of the throughput of serving a single LoRA.

Paperid:1975 poster

Authors:Minqin Zhu, Zexu Sun, Ruoxuan Xiong, Anpeng Wu, Baohong Li, Caizhi Tang, JUN ZHOU, Fei Wu, Kun Kuang

Title: Rethinking Causal Ranking: A Balanced Perspective on Uplift Model Evaluation

Abstract:
Uplift modeling is crucial for identifying individuals likely to respond to a treatment in applications like marketing and customer retention, but evaluating these models is challenging due to the inaccessibility of counterfactual outcomes in real-world settings. In this paper, we identify a fundamental limitation in existing evaluation metrics, such as the uplift and Qini curves, which fail to rank individuals with binary negative outcomes accurately. This can lead to biased evaluations, where biased models receive higher curve values than unbiased ones, resulting in suboptimal model selection. To address this, we propose the Principled Uplift Curve (PUC), a novel evaluation metric that assigns equal curve values of individuals with both positive and negative binary outcomes, offering a more balanced and unbiased assessment. We then derive the Principled Uplift Loss (PUL) function from the PUC and integrate it into a new uplift model, the Principled Treatment and Outcome Network (PTONet), to reduce bias during uplift model training. Experiments on both simulated and real-world datasets demonstrate that the PUC provides less biased evaluations, while PTONet outperforms existing methods. The source code is available at:https://github.com/euzmin/PUC.

Paperid:1976 poster

Authors:Yiran Wang, Chenshu Liu, Yunfan Li, Sanae Amani, Bolei Zhou, Lin Yang

Title: Hyper: Hyperparameter Robust Efficient Exploration in Reinforcement Learning

Abstract:
The exploration & exploitation dilemma poses significant challenges in reinforcement learning (RL). Recently, curiosity-based exploration methods achieved great success in tackling hard-exploration problems. However, they necessitate extensive hyperparameter tuning on different environments, which heavily limits the applicability and accessibility of this line of methods. In this paper, we characterize this problem via analysis of the agent behavior, concluding the fundamental difficulty of choosing a proper hyperparameter. We then identify the difficulty and the instability of the optimization when the agent learns with curiosity. We propose our method, hyperparameter robust exploration (\textbf{Hyper}), which extensively mitigates the problem by effectively regularizing the visitation of the exploration and decoupling the exploitation to ensure stable training. We theoretically justify that \textbf{Hyper} is provably efficient under function approximation setting and empirically demonstrate its appealing performance and robustness in various environments.

Paperid:1977 poster

Authors:Hang Zhou, Yuezhou Ma, Haixu Wu, Haowen Wang, Mingsheng Long

Title: Unisolver: PDE-Conditional Transformers Towards Universal Neural PDE Solvers

Abstract:
Deep models have recently emerged as promising tools to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physics-informed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e.g. a certain equation with a limited set of coefficients. This limits their generalization to diverse PDEs, preventing them from being practical surrogate models of numerical solvers. In this paper, we present Unisolver, a novel Transformer model trained on diverse data and conditioned on diverse PDEs, aiming towards a universal neural PDE solver capable of solving a wide scope of PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Inspired by the mathematical structure of PDEs that a PDE solution is fundamentally governed by a series of PDE components such as equation symbols and boundary conditions, we define a complete set of PDE components and flexibly embed them as domain-wise and point-wise deep conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art on three challenging large-scale benchmarks, showing impressive performance and generalizability. Code is available athttps://github.com/thuml/Unisolver.

Paperid:1978 poster

Authors:Zhihan Zhu, Yanhao Zhang, Yong Xia

Title: Best Subset Selection: Optimal Pursuit for Feature Selection and Elimination

Abstract:
This paper introduces two novel criteria: one for feature selection and another for feature elimination in the context of best subset selection, which is a benchmark problem in statistics and machine learning. From the perspective of optimization, we revisit the classical selection and elimination criteria in traditional best subset selection algorithms, revealing that these classical criteria capture only partial variations of the objective function after the entry or exit of features. By formulating and solving optimization subproblems for feature entry and exit exactly, new selection and elimination criteria are proposed, proved as the optimal decisions for the current entry-and-exit process compared to classical criteria. Replacing the classical selection and elimination criteria with the proposed ones generates a series of enhanced best subset selection algorithms. These generated algorithms not only preserve the theoretical properties of the original algorithms but also achieve significant meta-gains without increasing computational cost across various scenarios and evaluation metrics on multiple tasks such as compressed sensing and sparse regression.

Paperid:1979 poster

Authors:Huakun Luo, Haixu Wu, Hang Zhou, Lanxiang Xing, Yichen Di, Jianmin Wang, Mingsheng Long

Title: Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries

Abstract:
Although deep models have been widely explored in solving partial differential equations (PDEs), previous works are primarily limited to data only with up to tens of thousands of mesh points, far from the million-point scale required by industrial simulations that involve complex geometries. In the spirit of advancing neural PDE solvers to real industrial applications, we present Transolver++, a highly parallel and efficient neural solver that can accurately solve PDEs on million-scale geometries. Building upon previous advancements in solving PDEs by learning physical states via Transolver, Transolver++ is further equipped with an extremely optimized parallelism framework and a local adaptive mechanism to efficiently capture eidetic physical states from massive mesh points, successfully tackling the thorny challenges in computation and physics learning when scaling up input mesh size. Transolver++ increases the single-GPU input capacity to million-scale points for the first time and is capable of continuously scaling input size in linear complexity by increasing GPUs. Experimentally, Transolver++ yields 13% relative promotion across six standard PDE benchmarks and achieves over 20% performance gain in million-scale high-fidelity industrial simulations, whose sizes are 100$\times$ larger than previous benchmarks, covering car and 3D aircraft designs.

Paperid:1980 poster

Authors:Jie Liu, Pan Zhou, Zehao Xiao, Jiayi Shen, Wenzhe Yin, Jan-Jakob Sonke, Efstratios Gavves

Title: Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes

Abstract:
Interactive 3D segmentation has emerged as a promising solution for generating accurate object masks in complex 3D scenes by incorporating user-provided clicks. However, two critical challenges remain underexplored: (1) effectively generalizing from sparse user clicks to produce accurate segmentations and (2) quantifying predictive uncertainty to help users identify unreliable regions. In this work, we propose \emph{NPISeg3D}, a novel probabilistic framework that builds upon Neural Processes (NPs) to address these challenges. Specifically, NPISeg3D introduces a hierarchical latent variable structure with scene-specific and object-specific latent variables to enhance few-shot generalization by capturing both global context and object-specific characteristics. Additionally, we design a probabilistic prototype modulator that adaptively modulates click prototypes with object-specific latent variables, improving the model’s ability to capture object-aware context and quantify predictive uncertainty. Experiments on four 3D point cloud datasets demonstrate that NPISeg3D achieves superior segmentation performance with fewer clicks while providing reliable uncertainty estimations.

Paperid:1981 poster

Authors:Mang Ning, Mingxiao Li, Jianlin Su, Jia Haozhe, Lanmiao Liu, Martin Benes, Wenshuo Chen, Albert Ali Salah, Itir Onal Ertugrul

Title: DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space

Abstract:
This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to 512$\times$512 resolution without using the latent diffusion paradigm and beats latent diffusion (using SD-VAE) with only 1/4 training cost. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is athttps://github.com/forever208/DCTdiff.

Paperid:1982 poster

Authors:Yinong Oliver Wang, Nivedha Sivakumar, Falaah Arif Khan, Katherine Metcalf, Adam Golinski, Natalie Mackraz, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

Title: Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

Abstract:
The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerf, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable benchmark for modern LLMs. Combining our metric and dataset, we provide insightful comparisons of eight open-source LLMs. For example, Mistral-8B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, this work provides a holistic framework for LLM evaluation by jointly assessing fairness and uncertainty, enabling the development of more transparent and accountable AI systems.

Paperid:1983 poster

Authors:Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun

Title: Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Abstract:
Activation sparsity denotes the existence of substantial weakly-contributed neurons within feed-forward networks of large language models (LLMs), providing wide potential benefits such as computation acceleration. However, existing works lack thorough quantitative studies on this useful property, in terms of both its measurement and influential factors. In this paper, we address three underexplored research questions: (1) How can activation sparsity be measured more accurately? (2) How is activation sparsity affected by the model architecture and training process? (3) How can we build a more sparsely activated and efficient LLM? Specifically, we develop a generalizable and performance-friendly metric, named CETT-PPL-1%, to measure activation sparsity. Based on CETT-PPL-1%, we quantitatively study the influence of various factors and observe several important phenomena, such as the convergent power-law relationship between sparsity and training data amount, the higher competence of ReLU activation than mainstream SiLU activation, the potential sparsity merit of a small width-depth ratio, and the scale insensitivity of activation sparsity. Finally, we provide implications for building sparse and effective LLMs, and demonstrate the reliability of our findings by training a 2.4B model with a sparsity ratio of 93.52%, showing 4.1×speedup compared with its dense version. The codes and checkpoints are available athttps://github.com/thunlp/SparsingLaw/.

Paperid:1984 poster

Authors:Hanglei Hu, Yingying Guo, Zhikang Chen, Sen Cui, Fei Wu, Kun Kuang, Min Zhang, Bo Jiang

Title: Advancing Personalized Learning with Neural Collapse for Long-Tail Challenge

Abstract:
Personalized learning, especially data-based methods, has garnered widespread attention in recent years, aiming to meet individual student needs. However, many works rely on the implicit assumption that benchmarks are high-quality and well-annotated, which limits their practical applicability. In real-world scenarios, these benchmarks often exhibit long-tail distributions, significantly impacting model performance. To address this challenge, we propose a novel method calledNeural-Collapse-Advanced personalizedLearning (NCAL), designed to learn features that conform to the same simplex equiangular tight frame (ETF) structure. NCAL introduces Text-modality Collapse (TC) regularization to optimize the distribution of text embeddings within the large language model (LLM) representation space. Notably, NCAL is model-agnostic, making it compatible with various architectures and approaches, thereby ensuring broad applicability. Extensive experiments demonstrate that NCAL effectively enhances existing works, achieving new state-of-the-art performance. Additionally, NCAL mitigates class imbalance, significantly improving the model’s generalization ability.

Paperid:1985 poster

Authors:Yaowenhu, Wenxuan Tu, Yue Liu, Xinhang Wan, Junyi Yan, Taichun Zhou, Xinwang Liu

Title: Scalable Attribute-Missing Graph Clustering via Neighborhood Differentiation

Abstract:
Deep graph clustering (DGC), which aims to unsupervisedly separate the nodes in an attribute graph into different clusters, has seen substantial potential in various industrial scenarios like community detection and recommendation. However, the real-world attribute graphs, e.g., social networks interactions, are usually large-scale and attribute-missing. To solve these two problems, we propose a novel DGC method termedComplementaryMulti-ViewNeighborhoodDifferentiation (CMV-ND), which preprocesses graph structural information into multiple views in a complete but non-redundant manner. First, to ensure completeness of the structural information, we propose a recursive neighborhood search that recursively explores the local structure of the graph by completely expanding node neighborhoods across different hop distances. Second, to eliminate the redundancy between neighborhoods at different hops, we introduce a neighborhood differential strategy that ensures no overlapping nodes between the differential hop representations. Then, we constructK+1complementary views from theKdifferential hop representations and the features of the target node. Last, we apply existing multi-view clustering or DGC methods to the views. Experimental results on six widely used graph datasets demonstrate that CMV-ND significantly improves the performance of various methods.

Paperid:1986 poster

Authors:Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho

Title: Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Abstract:
We introduce ResGen, an efficient Residual Vector Quantization (RVQ)-based generative model for high-fidelity generation with fast sampling. RVQ improves data fidelity by increasing the number of quantization steps, referred to as depth, but deeper quantization typically increases inference steps in generative models. To address this, ResGen directly predicts the vector embedding of collective tokens rather than individual ones, ensuring that inference steps remain independent of RVQ depth. Additionally, we formulate token masking and multi-token prediction within a probabilistic framework using discrete diffusion and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation on ImageNet 256×256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models.

Paperid:1987 poster

Authors:Tingyi Cai, Yunliang Jiang, Ming Li, Lu Bai, Changqin Huang, Yi Wang

Title: HyperNear: Unnoticeable Node Injection Attacks on Hypergraph Neural Networks

Abstract:
With the growing adoption of Hypergraph Neural Networks (HNNs) to model higher-order relationships in complex data, concerns about their security and robustness have become increasingly important. However, current security research often overlooks the unique structural characteristics of hypergraph models when developing adversarial attack and defense strategies. To address this gap, we demonstrate that hypergraphs are particularly vulnerable to node injection attacks, which align closely with real-world applications. Through empirical analysis, we develop a relatively unnoticeable attack approach by monitoring changes in homophily and leveraging this self-regulating property to enhance stealth. Building on these insights, we introduce HyperNear, i.e., $\underline{N}$ode inj$\underline{E}$ction $\underline{A}$ttacks on hype$\underline{R}$graph neural networks, the first node injection attack framework specifically tailored for HNNs. HyperNear integrates homophily-preserving strategies to optimize both stealth and attack effectiveness. Extensive experiments show that HyperNear achieves excellent performance and generalization, marking the first comprehensive study of injection attacks on hypergraphs. Our code is available athttps://github.com/ca1man-2022/HyperNear.

Paperid:1988 poster

Authors:Bonan Li, Yinhan Hu, Songhua Liu, Xinchao Wang

Title: Control and Realism: Best of Both Worlds in Layout-to-Image without Training

Abstract:
Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies—Non-local Attention Energy Function and Adaptive Update—that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.

Paperid:1989 poster

Authors:Sicong Li, Qianqian Xu, Zhiyong Yang, Zitai Wang, Linchao Zhang, Xiaochun Cao, Qingming Huang

Title: Focal-SAM: Focal Sharpness-Aware Minimization for Long-Tailed Classification

Abstract:
Real-world datasets often follow a long-tailed distribution, making generalization to tail classes difficult. Recent methods resorted to long-tail variants of Sharpness-Aware Minimization (SAM), such as ImbSAM and CC-SAM, to improve generalization by flattening the loss landscape. However, these attempts face a trade-off between computational efficiency and control over the loss landscape. On the one hand, ImbSAM is efficient but offers only coarse control as it excludes head classes from the SAM process. On the other hand, CC-SAM provides fine-grained control through class-dependent perturbations but at the cost of efficiency due to multiple backpropagations. Seeing this dilemma, we introduce Focal-SAM, which assigns different penalties to class-wise sharpness, achieving fine-grained control without extra backpropagations, thus maintaining efficiency. Furthermore, we theoretically analyze Focal-SAM's generalization ability and derive a sharper generalization bound. Extensive experiments on both traditional and foundation models validate the effectiveness of Focal-SAM.

Paperid:1990 poster

Authors:Bo Yue, Jian Li, Guiliang Liu

Title: Provably Efficient Exploration in Inverse Constrained Reinforcement Learning

Abstract:
Optimizing objective functions subject to constraints is fundamental in many real-world applications. However, these constraints are often not readily defined and must be inferred from expert agent behaviors, a problem known as Inverse Constraint Inference. Inverse Constrained Reinforcement Learning (ICRL) is a common solver for recovering feasible constraints in complex environments, relying on training samples collected from interactive environments. However, the efficacy and efficiency of current sampling strategies remain unclear. We propose a strategic exploration framework for sampling with guaranteed efficiency to bridge this gap. By defining the feasible cost set for ICRL problems, we analyze how estimation errors in transition dynamics and the expert policy influence the feasibility of inferred constraints. Based on this analysis, we introduce two exploratory algorithms to achieve efficient constraint inference via 1) dynamically reducing the bounded aggregate error of cost estimations or 2) strategically constraining the exploration policy around plausibly optimal ones. Both algorithms are theoretically grounded with tractable sample complexity, and their performance is validated empirically across various environments.

Paperid:1991 poster

Authors:Annie Marsden, Evan Dogariu, Naman Agarwal, Xinyi Chen, Daniel Suo, Elad Hazan

Title: Provable Length Generalization in Sequence Prediction via Spectral Filtering

Abstract:
We consider the problem of length generalization in sequence prediction. We define a new metric of performance in this setting �?the Asymmetric-Regret�?which measures regret against a benchmark predictor with longer context length than available to the learner. We continue by studying this concept through the lens of the spectral filter-ing algorithm. We present a gradient-based learn-ing algorithm that provably achieves length generalization for linear dynamical systems. We conclude with proof-of-concept experiments which are consistent with our theory.

Paperid:1992 poster

Authors:Stelios Triantafyllou, Aleksa Sukovic, Yasaman Zolfimoselo, Goran Radanovic

Title: Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making

Abstract:
We address the challenge of explaining counterfactual outcomes in multi-agent Markov decision processes. In particular, we aim to explain the total counterfactual effect of an agent's action on the outcome of a realized scenario through its influence on the environment dynamics and the agents' behavior. To achieve this, we introduce a novel causal explanation formula that decomposes the counterfactual effect by attributing to each agent and state variable a score reflecting their respective contributions to the effect. First, we show that the total counterfactual effect of an agent's action can be decomposed into two components: one measuring the effect that propagates through all subsequent agents' actions and another related to the effect that propagates through the state transitions. Building on recent advancements in causal contribution analysis, we further decompose these two effects as follows. For the former, we consider agent-specific effects -- a causal concept that quantifies the counterfactual effect of an agent's action that propagates through a subset of agents. Based on this notion, we use Shapley value to attribute the effect to individual agents. For the latter, we consider the concept of structure-preserving interventions and attribute the effect to state variables based on their "intrinsic'' contributions. Through extensive experimentation, we demonstrate the interpretability of our approach in a Gridworld environment with LLM-assisted agents and a sepsis management simulator.

Paperid:1993 poster

Authors:Fei Shen, Cong Wang, Junyao Gao, Qin Guo, Jisheng Dang, Jinhui Tang, Tat-Seng Chua

Title: Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Abstract:
Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also introduce the {TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation.

Paperid:1994 poster

Authors:Arsalan Sharifnassab, Saber Salehkaleybar, Richard S. Sutton

Title: MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters

Abstract:
We address the challenge of optimizing meta-parameters (hyperparameters) in machine learning, a key factor for efficient training and high model performance. Rather than relying on expensive meta-parameter search methods, we introduce MetaOptimize: a dynamic approach that adjusts meta-parameters, particularly step sizes (also known as learning rates), during training. More specifically, MetaOptimize can wrap around any first-order optimization algorithm, tuning step sizes on the fly to minimize a specific form of regret that considers the long-term impact of step sizes on training, through a discounted sum of future losses. We also introduce lower-complexity variants of MetaOptimize that, in conjunction with its adaptability to various optimization algorithms, achieve performance comparable to those of the best hand-crafted learning rate schedules across diverse machine learning tasks.

Paperid:1995 poster

Authors:Carlota Parés Morlans, Michelle Yi, Claire Chen, Sarah A Wu, Rika Antonova, Tobias Gerstenberg, Jeannette Bohg

Title: Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel

Abstract:
Tasks that involve complex interactions between objects with unknown dynamics make planning before execution difficult. These tasks require agents to iteratively improve their actions after actively exploring causes and effects in the environment. For these type of tasks, we propose Causal-PIK, a method that leverages Bayesian optimization to reason about causal interactions via a Physics-Informed Kernel to help guide efficient search for the best next action. Experimental results on Virtual Tools and PHYRE physical reasoning benchmarks show that Causal-PIK outperforms state-of-the-art results, requiring fewer actions to reach the goal. We also compare Causal-PIK to human studies, including results from a new user study we conducted on the PHYRE benchmark. We find that Causal-PIK remains competitive on tasks that are very challenging, even for human problem-solvers.

Paperid:1996 poster

Authors:Doron Haviv, Aram-Alexandre Pooladian, Dana Pe'er, Brandon Amos

Title: Wasserstein Flow Matching: Generative Modeling Over Families of Distributions

Abstract:
Generative modeling typically concerns transporting a single source distribution to a target distribution via simple probability flows. However, in fields like computer graphics and single-cell genomics, samples themselves can be viewed as distributions, where standard flow matching ignores their inherent geometry. We propose Wasserstein flow matching (WFM), which lifts flow matching onto families of distributions using the Wasserstein geometry. Notably, WFM is the first algorithm capable of generating distributions in high dimensions, whether represented analytically (as Gaussians) or empirically (as point-clouds). Our theoretical analysis establishes that Wasserstein geodesics constitute proper conditional flows over the space of distributions, making for a valid FM objective. Our algorithm leverages optimal transport theory and the attention mechanism, demonstrating versatility across computational regimes: exploiting closed-form optimal transport paths for Gaussian families, while using entropic estimates on point-clouds for general distributions. WFM successfully generates both 2D & 3D shapes and high-dimensional cellular microenvironments from spatial transcriptomics data. Code is available atWassersteinFlowMatching.

Paperid:1997 poster

Authors:Louis Béthune, David Grangier, Dan Busbridge, Eleonora Gualdoni, marco cuturi, Pierre Ablin

Title: Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Abstract:
A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: \textit{(i)} if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and \textit{(ii)} the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as $1%$ of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.

Paperid:1998 poster

Authors:Clément Pierquin, Aurélien Bellet, Marc Tommasi, Matthieu Boussard

Title: Privacy Amplification Through Synthetic Data: Insights from Linear Regression

Abstract:
Synthetic data inherits the differential privacy guarantees of the model used to generate it. Additionally, synthetic data may benefit from privacy amplification when the generative model is kept hidden. While empirical studies suggest this phenomenon, a rigorous theoretical understanding is still lacking. In this paper, we investigate this question through the well-understood framework of linear regression. First, we establish negative results showing that if an adversary controls the seed of the generative model, a single synthetic data point can leak as much information as releasing the model itself. Conversely, we show that when synthetic data is generated from random inputs, releasing a limited number of synthetic data points amplifies privacy beyond the model's inherent guarantees. We believe our findings in linear regression can serve as a foundation for deriving more general bounds in the future.

Paperid:1999 poster

Authors:Ojash Neopane, Aaditya Ramdas, Aarti Singh

Title: Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect

Abstract:
Estimation and inference for the Average Treatment Effect (ATE) is a cornerstone of causal inference and often serves as the foundation for developing procedures for more complicated settings. Although traditionally analyzed in a batch setting, recent advances in martingale theory have paved the way for adaptive methods that can enhance the power of downstream inference. Despite these advances, progress in understanding and developing adaptive algorithms remains in its early stages. Existing work either focus on asymptotic analyses that overlook exploration-exploitation trade-offs relevant in finite-sample regimes or rely on simpler but suboptimal estimators. In this work, we address these limitations by studying adaptive sampling procedures that take advantage of the asymptotically optimal Augmented Inverse Probability Weighting (AIPW) estimator. Our analysis uncovers challenges obscured by asymptotic approaches and introduces a novel algorithmic design principle reminiscent of optimism in multi-armed bandits. This principled approach enables our algorithm to achieve significant theoretical and empirical gains compared to previous methods. Our findings mark a step forward in the advancement of adaptive causal inference methods in theory and practice.

Paperid:2000 poster

Authors:Qi Zhou, Dongxia Wang, Tianlin Li, Yun Lin, Yang Liu, Jin Song Dong, Qing Guo

Title: Defending LVLMs Against Vision Attacks Through Partial-Perception Supervision

Abstract:
Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise (guide) the LVLM's responses to the original images at inference time. We propose a black-box, training-free method calledDPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.

Paperid:2001 poster

Authors:Uri Sherman, Tomer Koren, Yishay Mansour

Title: Convergence of Policy Mirror Descent Beyond Compatible Function Approximation

Abstract:
Modern policy optimization methods roughly follow the policy mirror descent (PMD) algorithmic template, for which there are by now numerous theoretical convergence results. However, most of these either target tabular environments, or can be applied effectively only when the class of policies being optimized over satisfies strong closure conditions, which is typically not the case when working with parametric policy classes in large-scale environments. In this work, we develop a theoretical framework for PMD for general policy classes where we replace the closure conditions with a generally weaker variational gradient dominance assumption, and obtain upper bounds on the rate of convergence to the best-in-class policy. Our main result leverages a novel notion of smoothness with respect to a local norm induced by the occupancy measure of the current policy, and casts PMD as a particular instance of smooth non-convex optimization in non-Euclidean space.

Paperid:2002 poster

Authors:Na Lee, Konstantinos Barmpas, Yannis Panagakis, Dimitrios Adamos, Nikolaos Laskaris, Stefanos Zafeiriou

Title: Are Large Brainwave Foundation Models Capable Yet ? Insights from Fine-Tuning

Abstract:
Foundation Models have demonstrated significant success across various domains in Artificial Intelligence (AI), yet their capabilities for brainwave modeling remain unclear. In this paper, we comprehensively evaluate current Large Brainwave Foundation Models (LBMs) through systematic fine-tuning experiments across multiple Brain-Computer Interface (BCI) benchmark tasks, including memory tasks and sleep stage classification. Our extensive analysis shows that state-of-the-art LBMs achieve only marginal improvements (0.5%) over traditional deep architectures while requiring significantly more parameters (millions vs thousands), raising important questions about their efficiency and applicability in BCI contexts. Moreover, through detailed ablation studies and Low-Rank Adaptation (LoRA), we significantly reduce trainable parameters without performance degradation, while demonstrating that architectural and training inefficiencies limit LBMs' current capabilities. Our experiments span both full model fine-tuning and parameter-efficient adaptation techniques, providing insights into optimal training strategies for BCI applications. We pioneer the application of LoRA to LBMs, revealing that performance benefits generally emerge when adapting multiple neural network components simultaneously. These findings highlight the critical need for domain-specific development strategies to advance LBMs, suggesting that current architectures may require redesign to fully leverage the potential of foundation models in brainwave analysis.

Paperid:2003 poster

Authors:Haidong Kang

Title: Revisiting Neural Networks for Few-Shot Learning: A Zero-Cost NAS Perspective

Abstract:
Neural Architecture Search (NAS) has recently outperformed hand-designed networks in various artificial intelligence areas. However, previous works only target a pre-defined task. For a new task in few-shot learning (FSL) scenarios, the architecture is either searched from scratch, which is neither efficient nor flexible, or borrowed architecture from the ones obtained on other tasks, which may lead to sub-optimal. Can we select the best neural architectures without involving any training and eliminate a significant portion of the search cost for new tasks in FSL? In this work, we provide an affirmative answer by proposing a novel information bottleneck (IB) theory driven \textit{Few-shot Neural Architecture Search} (dubbed, IBFS) framework to address this issue. We first derive that the global convergence of Model-agnostic meta-learning (MAML) can be guaranteed by only considering the first-order loss landscape. Moreover, motivated by the observation that IB provides a unified view toward understanding machine learning models, we propose a novel Zero-Cost method tailored for FSL to rank and select architectures based on their \textit{expressivity} obtained by IB mechanisms. Extensive experiments show that IBFS achieves state-of-the-art performance in FSL without training, which demonstrates the effectiveness of our IBFS.

Paperid:2004 poster

Authors:Yan Shen, Ruihai Wu, Yubin Ke, Xinyuan Song, Zeyi Li, Xiaoqi Li, Hongwei Fan, Haoran Lu, Hao Dong

Title: BiAssemble: Learning Collaborative Affordance for Bimanual Geometric Assembly

Abstract:
Shape assembly, the process of combining parts into a complete whole, is a crucial skill for robots with broad real-world applications. Among the various assembly tasks, geometric assembly—where broken parts are reassembled into their original form (e.g., reconstructing a shattered bowl)—is particularly challenging. This requires the robot to recognize geometric cues for grasping, assembly, and subsequent bimanual collaborative manipulation on varied fragments. In this paper, we exploit the geometric generalization of point-level affordance, learning affordance aware of bimanual collaboration in geometric assembly with long-horizon action sequences. To address the evaluation ambiguity caused by geometry diversity of broken parts, we introduce a real-world benchmark featuring geometric variety and global reproducibility. Extensive experiments demonstrate the superiority of our approach over both previous affordance-based and imitation-based methods.

Paperid:2005 poster

Authors:Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, Honghao Wei

Title: An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints

Abstract:
Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret $\tilde{\mathcal{O}}(\sqrt{K})$ and strong constraint violation $\tilde{\mathcal{O}}(\sqrt{K})$ without relying on Slater’s condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.

Paperid:2006 poster

Authors:Rui Ai, Boxiang Lyu, Zhaoran Wang, Zhuoran Yang, Haifeng Xu

Title: An Instrumental Value for Data Production and its Application to Data Pricing

Abstract:
We develop a framework for capturing the instrumental value of data production processes, which accounts for two key factors: (a) the context of the agent’s decision-making; (b) how much data or information the buyer already possesses. We "micro-found" our data valuation function by establishing its connection to classic notions of signals and information design in economics. When instantiated in Bayesian linear regression, our value naturally corresponds to information gain. Applying our proposed data value in Bayesian linear regression for monopoly pricing, we show that if the seller can fully customize data production, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If data can only be constructed from an existing data pool, this limits the seller’s ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most $\log(\kappa)$ less than the first-best, where $\kappa$ is the condition number associated with the data matrix. As a corollary, the seller extracts the first-best revenue in the multi-armed bandits special case.

Paperid:2007 poster

Authors:Steve Azzolin, SAGAR MALHOTRA, Andrea Passerini, Stefano Teso

Title: Beyond Topological Self-Explainable GNNs: A Formal Explainability Perspective

Abstract:
Self-Explainable Graph Neural Networks (SE-GNNs) are popular explainable-by-design GNNs, but their explanations' properties and limitations are not well understood. Our first contribution fills this gap by formalizing the explanations extracted by some popular SE-GNNs, referred to as Minimal Explanations (MEs), and comparing them to established notions of explanations, namely Prime Implicant (PI) and faithful explanations. Our analysis reveals that MEs match PI explanations for a restricted but significant family of tasks. In general, however, they can be less informative than PI explanations and are surprisingly misaligned with widely accepted notions of faithfulness. Although faithful and PI explanations are informative, they are intractable to find and we show that they can be prohibitively large. Given these observations, a natural choice is to augment SE-GNNs with alternative modalities of explanations taking care of SE-GNNs�?limitations. To this end, we propose Dual-Channel GNNs that integrate a white-box rule extractor and a standard SE-GNN, adaptively combining both channels. Our experiments show that even a simple instantiation of Dual-Channel GNNs can recover succinct rules and perform on par or better than widely used SE-GNNs.

Paperid:2008 poster

Authors:Yi Cai, Thibaud Ardoin, Gerhard Wunder

Title: GEFA: A General Feature Attribution Framework Using Proxy Gradient Estimation

Abstract:
Feature attribution explains machine decisions by quantifying each feature's contribution. While numerous approaches rely on exact gradient measurements, recent work has adopted gradient estimation to derive explanatory information under query-level access, a restrictive yet more practical accessibility assumption known as the black-box setting. Following this direction, this paper introduces GEFA (Gradient-estimation-based Explanation For All), a general feature attribution framework leveraging proxy gradient estimation. Unlike the previous attempt that focused on explaining image classifiers, the proposed explainer derives feature attributions in a proxy space, making it generally applicable to arbitrary black-box models, regardless of input type. In addition to its close relationship with Integrated Gradients, our approach, a path method built upon estimated gradients, surprisingly produces unbiased estimates of Shapley Values. Compared to traditional sampling-based Shapley Value estimators, GEFA avoids potential information waste sourced from computing marginal contributions, thereby improving explanation quality, as demonstrated in quantitative evaluations across various settings.

Paperid:2009 poster

Authors:Ismail Alkhouri, Cedric Le Denmat, Yingjie Li, CUNXI YU, Jia Liu, Rongrong Wang, Alvaro Velasquez

Title: Differentiable Quadratic Optimization For the Maximum Independent Set Problem

Abstract:
Combinatorial Optimization (CO) addresses many important problems, including the challenging Maximum Independent Set (MIS) problem. Alongside exact and heuristic solvers, differentiable approaches have emerged, often using continuous relaxations of quadratic objectives. Noting that an MIS in a graph is a Maximum Clique (MC) in its complement, we propose a new quadratic formulation for MIS by incorporating an MC term, improving convergence and exploration. We show that every maximal independent set corresponds to a local minimizer, derive conditions with respect to the MIS size, and characterize stationary points. To tackle the non-convexity of the objective, we propose optimizing several initializations in parallel using momentum-based gradient descent, complemented by an efficient MIS checking criterion derived from our theory. We dub our method asparallelizedClique-InformedQuadraticOptimization for MIS (pCQO-MIS). Our experimental results demonstrate the effectiveness of the proposed method compared to exact, heuristic, sampling, and data-centric approaches. Notably, our method avoids the out-of-distribution tuning and reliance on (un)labeled data required by data-centric methods, while achieving superior MIS sizes and competitive run-time relative to their inference time. Additionally, a key advantage of pCQO-MIS is that, unlike exact and heuristic solvers, the run-time scales only with the number of nodes in the graph, not the number of edges. Our code is available at the GitHub repository:https://github.com/ledenmat/pCQO-mis-benchmark/tree/refactor.

Paperid:2010 poster

Authors:Sudeepta Mondal, Zhuolin Jiang, Ganesh Sundaramoorthi

Title: A Variational Information Theoretic Approach to Out-of-Distribution Detection

Abstract:
We present a theory for the construction of out-of-distribution (OOD) detection features for neural networks. We introduce random features for OOD through a novel information-theoretic loss functional consisting of two terms, the first based on the KL divergence separates resulting in-distribution (ID) and OOD feature distributions and the second term is the Information Bottleneck, which favors compressed features that retain the OOD information. We formulate a variational procedure to optimize the loss and obtain OOD features. Based on assumptions on OOD distributions, one can recover properties of existing OOD features, i.e., shaping functions. Furthermore, we show that our theory can predict a new shaping function that out-performs existing ones on OOD benchmarks. Our theory provides a general framework for constructing a variety of new features with clear explainability.

Paperid:2011 poster

Authors:Congcong Zhu, Xiaoyan Xu, Jiayue Han, Jingrun Chen

Title: Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models

Abstract:
Auto-regressive partial differential equation (PDE) foundation models have shown great potential in handling time-dependent data. However, these models suffer from error accumulation caused by the shortcut problem deeply rooted in auto-regressive prediction. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data. The code is available at \url{https://github.com/SCAILab-USTC/PITA}.

Paperid:2012 poster

Authors:Yangyang Shen, Xiao Tan, Dian Shen, Meng Wang, Beilun Wang

Title: Bi-perspective Splitting Defense: Achieving Clean-Seed-Free Backdoor Security

Abstract:
Backdoor attacks have seriously threatened deep neural networks (DNNs) by embedding concealed vulnerabilities through data poisoning. To counteract these attacks, training benign models from poisoned data garnered considerable interest from researchers. High-performing defenses often rely on additional clean subsets/seeds, which is untenable due to increasing privacy concerns and data scarcity. In the absence of additional clean subsets/seeds, defenders resort to complex feature extraction and analysis, resulting in excessive overhead and compromised performance. To address these challenges, we identify the key lies in sufficient utilization of both the easier-to-obtain target labels and clean hard samples. In this work, we propose a Bi-perspective Splitting Defense (BSD). BSD distinguishes clean samples using both semantic and loss statistics characteristics through open set recognition-based splitting (OSS) and altruistic model-based data splitting (ALS) respectively. Through extensive experiments on benchmark datasets and against representative attacks, we empirically demonstrate that BSD surpasses existing defenses by over 20% in average Defense Effectiveness Rating (DER), achieving clean data-free backdoor security.

Paperid:2013 poster

Authors:Benjamin Holzschuh, Qiang Liu, Georg Kohl, Nils Thuerey

Title: PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations

Abstract:
We introduce PDE-Transformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. This helps to maintain a consistent information density of tokens when learning multiple types of PDEs simultaneously. We demonstrate that our pre-trained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations. Our source code is available athttps://github.com/tum-pbs/pde-transformer.

Paperid:2014 poster

Authors:Yalan Qin, Nan Pu, Guorui Feng, Nicu Sebe

Title: Robust Consensus Anchor Learning for Efficient Multi-view Subspace Clustering

Abstract:
As a leading unsupervised classification algorithm in artificial intelligence, multi-view subspace clustering segments unlabeled data from different subspaces. Recent works based on the anchor have been proposed to decrease the computation complexity for the datasets with large scales in multi-view clustering. The major differences among these methods lie on the objective functions they define. Despite considerable success, these works pay few attention to guaranting the robustness of learned consensus anchors via effective manner for efficient multi-view clustering and investigating the specific local distribution of cluster in the affine subspace. Besides, the robust consensus anchors as well as the common cluster structure shared by different views are not able to be simultaneously learned. In this paper, we propose Robust Consensus anchors learning for efficient multi-view Subspace Clustering (RCSC). We first show that if the data are sufficiently sampled from independent subspaces, and the objective function meets some conditions, the achieved anchor graph has the block-diagonal structure. As a special case, we provide a model based on Frobenius norm, non-negative and affine constraints in consensus anchors learning, which guarantees the robustness of learned consensus anchors for efficient multi-view clustering and investigates the specific local distribution of cluster in the affine subspace. While it is simple, we theoretically give the geometric analysis regarding the formulated RCSC. The union of these three constraints is able to restrict how each data point is described in the affine subspace with specific local distribution of cluster for guaranting the robustness of learned consensus anchors. RCSC takes full advantages of correlation among consensus anchors, which encourages the grouping effect and groups highly correlated consensus anchors together with the guidance of view-specific projection. The anchor graph construction, partition and robust anchor learning are jointly integrated into a unified framework. It ensures the mutual enhancement for these procedures and helps lead to more discriminative consensus anchors as well as the cluster indicator. We then adopt an alternative optimization strategy for solving the formulated problem. Experiments performed on eight multi-view datasets confirm the superiority of RCSC based on the effectiveness and efficiency.

Paperid:2015 poster

Authors:Jinmin He, Kai Li, Yifan Zang, Haobo Fu, QIANG FU, Junliang Xing, Jian Cheng

Title: Goal-Oriented Skill Abstraction for Offline Multi-Task Reinforcement Learning

Abstract:
Offline multi-task reinforcement learning aims to learn a unified policy capable of solving multiple tasks using only pre-collected task-mixed datasets, without requiring any online interaction with the environment. However, it faces significant challenges in effectively sharing knowledge across tasks. Inspired by the efficient knowledge abstraction observed in human learning, we propose Goal-Oriented Skill Abstraction (GO-Skill), a novel approach designed to extract and utilize reusable skills to enhance knowledge transfer and task performance. Our approach uncovers reusable skills through a goal-oriented skill extraction process and leverages vector quantization to construct a discrete skill library. To mitigate class imbalances between broadly applicable and task-specific skills, we introduce a skill enhancement phase to refine the extracted skills. Furthermore, we integrate these skills using hierarchical policy learning, enabling the construction of a high-level policy that dynamically orchestrates discrete skills to accomplish specific tasks. Extensive experiments on diverse robotic manipulation tasks within the MetaWorld benchmark demonstrate the effectiveness and versatility of GO-Skill.

Paperid:2016 poster

Authors:Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan

Title: AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Abstract:
Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, making it computationally infeasible to include all of them in the training objective. We propose Active Multi-Preference Optimization (AMPO), which combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses, then pick a small but informative subset—covering reward extremes and distinct semantic clusters—for preference optimization.The resulting contrastive-training scheme identifies not only the best and worst answers but also subtle, underexplored modes crucial for robust alignment. Theoretically, we provide guarantees of expected reward maximization using our active selection method. Empirically, AMPO achieves state-of-the-art results on AlpacaEval with Llama 8B and Mistral 7B. We release our datasetshere.

Paperid:2017 poster

Authors:Ziru Wang, Mengmeng Wang, Jade Dai, Teli Ma, Guo-Jun Qi, Yong Liu, Guang Dai, Jingdong Wang

Title: DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making

Abstract:
Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.

Paperid:2018 poster

Authors:William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe

Title: OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Abstract:
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. Scaling to larger models can improve ASR performance across the board, in both low and high resource languages, improving the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released onhttps://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8dfor future studies.

Paperid:2019 poster

Authors:Han Shao, Shuo Xie, Kunhe Yang

Title: Should Decision-Makers Reveal Classifiers in Online Strategic Classification?

Abstract:
Strategic classification addresses a learning problem where a decision-maker implements a classifier over agents who may manipulate their features in order to receive favorable predictions. In the standard model of online strategic classification, in each round, the decision-maker implements and publicly reveals a classifier, after which agents perfectly best respond based on this knowledge. However, in practice, whether to disclose the classifier is often debated---some decision-makers believe that hiding the classifier can prevent misclassification errors caused by manipulation. In this paper, we formally examine how limiting the agents' access to the current classifier affects the decision-maker's performance. Specifically, we consider an extended online strategic classification setting where agents lack direct knowledge about the current classifier and instead manipulate based on a weighted average of historically implemented classifiers. Our main result shows that in this setting, the decision-maker incurs(1−�?�?orkintimes more mistakes compared to the full-knowledge setting, wherekinis the maximum in-degree of the manipulation graph (representing how many distinct feature vectors can be manipulated to appear as a single one), andγis the discount factor indicating agents' memory of past classifiers. Our results demonstrate how withholding access to the classifier can backfire and degrade the decision-maker's performance in online strategic classification.

Paperid:2020 poster

Authors:Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O Arik

Title: LLM Alignment as Retriever Optimization: An Information Retrieval Perspective

Abstract:
Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR's retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.

Paperid:2021 poster

Authors:Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, Jun Zhu

Title: RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

Abstract:
Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch—achieving high-quality2×extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables3×extrapolation by minimal fine-tuning without long videos.

Paperid:2022 poster

Authors:Haoyun Jiang, Haolin li, jianwei zhang, Fei Huang, Qiang Hu, Minmin Sun, Shuai Xiao, Yong Li, Junyang Lin, Jiangchao Yao

Title: CateKV: On Sequential Consistency for Long-Context LLM Inference Acceleration

Abstract:
Large language models (LLMs) have demonstrated strong capabilities in handling long-context tasks, but processing such long contexts remains challenging due to the substantial memory requirements and inference latency. In this work, we discover that certain attention heads exhibit sequential consistency in their attention patterns, which can be persistently identified using a coefficient-of-variation-based algorithm. Inspired by this observation, we propose CateKV, a hybrid KV cache method that retains only critical token information for consistent heads, thereby reducing KV cache size and computational overhead, while preserving the majority of KV pairs in adaptive heads to ensure high accuracy. We show the unique characteristics of our algorithm and its extension with existing acceleration methods. Comprehensive evaluations on long-context benchmarks show that, while maintaining accuracy comparable to full attention, CateKV reduces memory usage by up to2.72×and accelerates decoding by2.18×in single-sample inputs, and boosts throughput by3.96×in batch scenarios.

Paperid:2023 poster

Authors:Qirui Wu, Shizhou Zhang, De Cheng, Yinghui Xing, Di Xu, PENG WANG, Yanning Zhang

Title: Demystifying Catastrophic Forgetting in Two-Stage Incremental Object Detector

Abstract:
Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing component-specific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code will be available soon.

Paperid:2024 poster

Authors:Hongbi Zhou, Zhangkai Ni

Title: Perceptual-GS: Scene-adaptive Perceptual Densification for Gaussian Splatting

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis. However, existing methods struggle to adaptively optimize the distribution of Gaussian primitives based on scene characteristics, making it challenging to balance reconstruction quality and efficiency. Inspired by human perception, we propose scene-adaptive perceptual densification for Gaussian Splatting (Perceptual-GS), a novel framework that integrates perceptual sensitivity into the 3DGS training process to address this challenge. We first introduce a perception-aware representation that models human visual sensitivity while constraining the number of Gaussian primitives. Building on this foundation, we develop a perceptual sensitivity-adaptive distribution to allocate finer Gaussian granularity to visually critical regions, enhancing reconstruction quality and robustness. Extensive evaluations on multiple datasets, including BungeeNeRF for large-scale scenes, demonstrate that Perceptual-GS achieves state-of-the-art performance in reconstruction quality, efficiency, and robustness. The code is publicly available at:https://github.com/eezkni/Perceptual-GS

Paperid:2025 poster

Authors:Maksim Zhdanov, Max Welling, Jan-Willem van de Meent

Title: Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems

Abstract:
Large-scale physical systems defined on irregular grids pose significant scalability challenges for deep learning methods, especially in the presence of long-range interactions and multi-scale coupling. Traditional approaches that compute all pairwise interactions, such as attention, become computationally prohibitive as they scale quadratically with the number of nodes. We present Erwin, a hierarchical transformer inspired by methods from computational many-body physics, which combines the efficiency of tree-based algorithms with the expressivity of attention mechanisms. Erwin employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size. Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, it captures both fine-grained local details and global features. We demonstrate Erwin's effectiveness across multiple domains, including cosmology, molecular dynamics, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.

Paperid:2026 poster

Authors:Wei Dai, Peilin Chen, Malinda Lu, Daniel A Li, Haowen Wei, Hejie Cui, Paul Pu Liang

Title: CLIMB: Data Foundations for Large Scale Multimodal Clinical Foundation Models

Abstract:
Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models' generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to adavance clinical AI research. Code is released athttps://github.com/DDVD233/climb.

Paperid:2027 poster

Authors:Shurui Gui, Xiner Li, Shuiwang Ji

Title: Discovering Physics Laws of Dynamical Systems via Invariant Function Learning

Abstract:
We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environment-specific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum's natural motionα2sin⁡θtby observing pendulum dynamics in different environments, such as the damped environmentα2sin�?θt)−ρωtand powered environmentα2sin�?θt)+ρωt|ωt|. Here, we formulate this problem as aninvariant function learningtask and propose a new method, known asDisentanglement ofInvariantFunctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws.

Paperid:2028 poster

Authors:Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu

Title: Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models

Abstract:
The remarkable performance of long reasoning models can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where long reasoning models generate redundant solutions that contribute minimally to accuracy and diversity, thereby wasting computational resources on simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by long reasoning models. Using a self-training paradigm, we propose strategies to mitigate overthinking, simplifying reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME. Our code is open-source and available athttps://github.com/galaxyChen/overthinking.

Paperid:2029 poster

Authors:Guoqing Chao, Zhenghao Zhang, Lei Meng, Jie Wen, Dianhui Chu

Title: Federated Incomplete Multi-view Clustering with Globally Fused Graph Guidance

Abstract:
Federated multi-view clustering has been proposed to mine the valuable information within multi-view data distributed across different devices and has achieved impressive results while preserving the privacy. Despite great progress, most federated multi-view clustering methods only used global pseudo-labels to guide the downstream clustering process and failed to exploit the global information when extracting features. In addition, missing data problem in federated multi-view clustering task is less explored. To address these problems, we propose a novel Federated Incomplete Multi-view Clustering method with globally Fused Graph guidance (FIMCFG). Specifically, we designed a dual-head graph convolutional encoder at each client to extract two kinds of underlying features containing global and view-specific information. Subsequently, under the guidance of the fused graph, the two underlying features are fused into high-level features, based on which clustering is conducted under the supervision of pseudo-labeling. Finally, the high-level features are uploaded to the server to refine the graph fusion and pseudo-labeling computation. Extensive experimental results demonstrate the effectiveness and superiority of FIMCFG. Our code is publicly available athttps://github.com/PaddiHunter/FIMCFG.

Paperid:2030 poster

Authors:Li Shen, Anke Tang, Yong Luo, Tao Sun, Han Hu, Xiaochun Cao

Title: Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision

Abstract:
Pruning is a widely used technique for compressing large neural networks that eliminates weights that have minimal impact on the model's performance. Current pruning methods, exemplified by magnitude pruning, assign an importance score to each weight based on its magnitude and remove weights with scores below a certain threshold. Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, potentially impairing performance. Especially when the sparsity ratio is high, the gap becomes more pronounced. To mitigate this issue, we introduce a method to bridge the gap left by pruning by utilizing a low-rank approximation of the difference between the dense and sparse matrices. Our method entails the iterative refinement of the sparse weight matrix augmented by a low-rank adjustment. This technique captures and retains the essential information often lost during pruning, thereby improving the performance of the pruned model. Furthermore, we offer a comprehensive theoretical analysis of our approach, emphasizing its convergence properties and establishing a solid basis for its efficacy. Experimental results on LLaMa models validate its effectiveness on large language models across various pruning techniques and sparsity levels. Our method shows significant improvements: at 50% sparsity, it reduces perplexity by 53.9% compared to conventional magnitude pruning on LLaMa-7B. Furthermore, to achieve a specific performance target, our approach enables an 8.6% reduction in model parameters while maintaining a sparsity ratio of about 50%.

Paperid:2031 poster

Authors:Aro Lee, Ji Oon Lee

Title: Fluctuations of the largest eigenvalues of transformed spiked Wigner matrices

Abstract:
We consider a spiked random matrix model obtained by applying a function entrywise to a signal-plus-noise symmetric data matrix. We prove that the largest eigenvalue of this model, which we call a transformed spiked Wigner matrix, exhibits Baik-Ben Arous-Péché (BBP) type phase transition. We show that the law of the fluctuation converges to the Gaussian distribution when the effective signal-to-noise ratio (SNR) is above the critical number, and to the GOE Tracy-Widom distribution when the effective SNR is below the critical number. We provide precise formulas for the limiting distributions and also concentration estimates for the largest eigenvalues, both in the supercritical and the subcritical regimes.

Paperid:2032 poster

Authors:Ibrahim Alabdulmohsin, Andreas Peter Steiner

Title: A Tale of Two Structures: Do LLMs Capture the Fractal Complexity of Language?

Abstract:
Language exhibits a fractal structure in its information-theoretic complexity (i.e. bits per token), with self-similarity across scales and long-range dependence (LRD). In this work, we investigate whether large language models (LLMs) can replicate such fractal characteristics and identify conditions-such as temperature setting and prompting method-under which they may fail. Moreover, we find that the fractal parameters observed in natural language are contained within a narrow range, whereas those of LLMs' output vary widely, suggesting that fractal parameters might prove helpful in detecting a non-trivial portion of LLM-generated texts. Notably, these findings, and many others reported in this work, are robust to the choice of the architecture; e.g. Gemini 1.0 Pro, Mistral-7B and Gemma-2B. We also release a dataset comprising over 240,000 articles generated by various LLMs (both pretrained and instruction-tuned) with different decoding temperatures and prompting methods, along with their corresponding human-generated texts. We hope that this work highlights the complex interplay between fractal properties, prompting, and statistical mimicry in LLMs, offering insights for generating, evaluating and detecting synthetic texts.

Paperid:2033 poster

Authors:Zeyu Wang, Yao-Hui Li, Xin Li, Hongyu Zang, Romain Laroche, Riashat Islam

Title: Learning Fused State Representations for Control from Multi-View Observations

Abstract:
Multi-View Reinforcement Learning (MVRL) seeks to provide agents with multi-view observations, enabling them to perceive environment with greater effectiveness and precision. Recent advancements in MVRL focus on extracting latent representations from multiview observations and leveraging them in control tasks. However, it is not straightforward to learn compact and task-relevant representations, particularly in the presence of redundancy, distracting information, or missing views. In this paper, we proposeMulti-viewFusionState forControl (MFSC), firstly incorporating bisimulation metric learning into MVRL to learn task-relevant representations. Furthermore, we propose a multiview-based mask and latent reconstruction auxiliary task that exploits shared information across views and improves MFSC’s robustness in missing views by introducing a mask token. Extensive experimental results demonstrate that our method outperforms existing approaches in MVRL tasks. Even in more realistic scenarios with interference or missing views, MFSC consistently maintains high performance. The project code is available athttps://github.com/zpwdev/MFSC.

Paperid:2034 poster

Authors:Lin Zhu, Xiantao Ma, Xiao Wang, Lizhi Wang, Hua Huang

Title: EvFocus: Learning to Reconstruct Sharp Images from Out-of-Focus Event Streams

Abstract:
Event cameras are innovative sensors that capture brightness changes as asynchronous events rather than traditional intensity frames. These cameras offer substantial advantages over conventional cameras, including high temporal resolution, high dynamic range, and the elimination of motion blur. However, defocus blur, a common image quality degradation resulting from out-of-focus lenses, complicates the challenge of event-based imaging. Due to the unique imaging mechanism of event cameras, existing focusing algorithms struggle to operate efficiently on sparse event data. In this work, we propose EvFocus, a novel architecture designed to reconstruct sharp images from defocus event streams for the first time. Our work includes the development of an event-based out-of-focus camera model and a simulator to generate realistic defocus event streams for robust training and testing. EvDefous integrates a temporal information encoder, a blur-aware two-branch decoder, and a reconstruction and re-defocus module to effectively learn and correct defocus blur. Extensive experiments on both simulated and real-world datasets demonstrate that EvFocus outperforms existing methods across varying lighting conditions and blur sizes, proving its robustness and practical applicability in event-based defocus imaging.

Paperid:2035 poster

Authors:Chen Xu

Title: Global Optimization with a Power-Transformed Objective and Gaussian Smoothing

Abstract:
We propose a novel method, namely Gaussian Smoothing with a Power-Transformed Objective (GS-PowerOpt), that solves global optimization problems in two steps: (1) perform a (exponential) power-$N$ transformation to the not necessarily differentiable objective $f:\mathbb{R}^d\rightarrow \mathbb{R}$ and get $f_N$, and (2) optimize the Gaussian-smoothed $f_N$ with stochastic approximations. Under mild conditions on $f$, for any $\delta>0$, we prove that with a sufficiently large power $N_\delta$, this method converges to a solution in the $\delta$-neighborhood of $f$'s global optimum point, at the iteration complexity of $O(d^4\varepsilon^{-2})$. If we require that $f$ is differentiable and further assume the Lipschitz condition on $f$ and its gradient, the iteration complexity reduces to $O(d^2\varepsilon^{-2})$, which is significantly faster than the standard homotopy method. In most of the experiments performed, our method produces better solutions than other algorithms that also apply the smoothing technique.

Paperid:2036 poster

Authors:Matthew Robert Wicker, Philip Sosnin, Igor Shilov, Adrianna Janik, Mark Niklas Mueller, Yves-Alexandre de Montjoye, Adrian Weller, Calvin Tsay

Title: Certification for Differentially Private Prediction in Gradient-Based Training

Abstract:
We study private prediction where differential privacy is achieved by adding noise to the outputs of a non-private model. Existing methods rely on noise proportional to the global sensitivity of the model, often resulting in sub-optimal privacy-utility trade-offs compared to private training. We introduce a novel approach for computing dataset-specific upper bounds on prediction sensitivity by leveraging convex relaxation and bound propagation techniques. By combining these bounds with the smooth sensitivity mechanism, we significantly improve the privacy analysis of private prediction compared to global sensitivity-based approaches. Experimental results across real-world datasets in medical image classification and natural language processing demonstrate that our sensitivity bounds are can be orders of magnitude tighter than global sensitivity. Our approach provides a strong basis for the development of novel privacy preserving technologies.

Paperid:2037 poster

Authors:Won-Jun Jang, Hyeon-Seo Park, Si-Hyeon Lee

Title: Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead

Abstract:
Federated ensemble distillation addresses client heterogeneity by generating pseudo-labels for an unlabeled server dataset based on client predictions and training the server model using the pseudo-labeled dataset. The unlabeled server dataset can either be pre-existing or generated through a data-free approach. The effectiveness of this approach critically depends on the method of assigning weights to client predictions when creating pseudo-labels, especially in highly heterogeneous settings. Inspired by theoretical results from GANs, we propose a provably near-optimal weighting method that leverages client discriminators trained with a server-distributed generator and local datasets. Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.

Paperid:2038 poster

Authors:Jack Min Ong, Matthew Di Ferrante, Aaron Pazdera, Ryan Garner, Sami Jaghouar, Manveer Basra, Max Ryabinin, Johannes Hagemann

Title: TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference

Abstract:
Large language models (LLMs) have proven to be very capable, but access to frontier models currently relies on inference providers. This introduces trust challenges: how can we be sure that the provider is using the model configuration they claim? We propose TOPLOC, a novel method for verifiable inference that addresses this problem. TOPLOC leverages a compact locality-sensitive hashing mechanism for intermediate activations, which can detect unauthorized modifications to models, prompts, or precision with 100% accuracy, achieving no false positives or negatives in our empirical evaluations. Our approach is robust across diverse hardware configurations, GPU types, and algebraic reorderings, which allows for validation speeds significantly faster than the original inference. By introducing a polynomial encoding scheme, TOPLOC minimizes the memory overhead of the generated proofs by $1000\times$, requiring only 258 bytes of storage per 32 new tokens, compared to the 262 KB requirement of storing the token embeddings directly for Llama 3.1-8B-Instruct. Our method empowers users to verify LLM inference computations efficiently, fostering greater trust and transparency in open ecosystems and laying a foundation for decentralized, verifiable and trustless AI services.

Paperid:2039 poster

Authors:Alex Kokot, Octavian-Vlad Murad, Marina Meila

Title: The Noisy Laplacian: a Threshold Phenomenon for Non-Linear Dimension Reduction

Abstract:
In this paper, we clarify the effect of noise on common spectrally motivated algorithms such as Diffusion Maps (DM) for dimension reduction. Empirically, these methods are much more robust to noise than current work suggests. Specifically, existing consistency results require that either the noise amplitude or dimensionality must vary with the sample size $n$. We provide new theoretical results demonstrating that low-frequency eigenpairs reliably capture the geometry of the underlying manifold under a constant noise level, up to a dimension independent threshold $O(r^{-2})$, where $r$ is the noise amplitude. Our results rely on a decomposition of the manifold Laplacian in the Sasaki metric, a technique not used before in this area, to our knowledge. We experimentally validate our theoretical predictions. Additionally, we observe similar robust behavior for other manifold learning algorithms which are not based on computing the Laplacian, namely LTSA and VAE.

Paperid:2040 poster

Authors:Jian Bi, Qianliang Wu, Xiang Li, Shuo Chen, Jianjun Qian, lei luo, Jian Yang

Title: Rethinking Point Cloud Data Augmentation: Topologically Consistent Deformation

Abstract:
Data augmentation has been widely used in machine learning. Its main goal is to transform and expand the original data using various techniques, creating a more diverse and enriched training dataset. However, due to the disorder and irregularity of point clouds, existing methods struggle to enrich geometric diversity and maintain topological consistency, leading to imprecise point cloud understanding. In this paper, we propose SinPoint, a novel method designed to preserve the topological structure of the original point cloud through a homeomorphism. It utilizes the Sine function to generate smooth displacements. This simulates object deformations, thereby producing a rich diversity of samples. In addition, we propose a Markov chain Augmentation Process to further expand the data distribution by combining different basic transformations through a random process. Our extensive experiments demonstrate that our method consistently outperforms existing Mixup and Deformation methods on various benchmark point cloud datasets, improving performance for shape classification and part segmentation tasks. Specifically, when used with PointNet++ and DGCNN, our method achieves a state-of-the-art accuracy of 90.2 in shape classification with the real-world ScanObjectNN dataset. We release the code athttps://github.com/CSBJian/SinPoint.

Paperid:2041 poster

Authors:Jason Qin, Hans-Hermann Wessels, Carlos Fernandez-Granda, Yuhan Hao

Title: Active Learning for Efficient Discovery of Optimal Combinatorial Perturbations

Abstract:
Combinatorial CRISPR screening enables large-scale identification of synergistic gene pairs for combination therapies, but exhaustive experimentation is infeasible. We introduce NAIAD, an active learning framework that efficiently discovers optimal gene pairs by leveraging single-gene perturbation effects and adaptive gene embeddings that scale with the training data size, mitigating overfitting in small-sample learning while capturing complex gene interactions as more data is collected. Evaluated on four CRISPR datasets with over 350,000 interactions, NAIAD trained on small datasets outperforms existing models by up to 40%. Its recommendation system prioritizes gene pairs with maximum predicted effects, accelerating discovery with fewer experiments. We also extend NAIAD to optimal drug combination identification among 2,000 candidates. Overall, NAIAD enhances combinatorial perturbation design and drives advances in genomics research and therapeutic development in combination therapy. Our code is publicly available at:https://github.com/NeptuneBio/NAIAD

Paperid:2042 poster

Authors:Zhendong Wang, Max Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, Ming-Yu Liu, Yu Zeng

Title: One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Abstract:
Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for real-time applications in resource-constrained robotics setups and dynamically changing environments. In this paper, we introduce the One-Step Diffusion Policy (OneDP), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only $2%$-$10%$ additional pre-training cost for convergence. We evaluated OneDP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. A video demo is provided at our project page, and the code will be publicly available.

Paperid:2043 poster

Authors:Jihwan Jeong, Xiaoyu Wang, Jingmin Wang, Scott Sanner, Pascal Poupart

Title: Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

Abstract:
Offline reinforcement learning (RL) is crucial when online exploration is costly or unsafe but often struggles with high epistemic uncertainty due to limited data. Existing methods rely on fixed conservative policies, restricting adaptivity and generalization. To address this, we propose Reflect-then-Plan (RefPlan), a noveldoubly Bayesianoffline model-based (MB) planning approach. RefPlan unifies uncertainty modeling and MB planning by recasting planning as Bayesian posterior estimation. At deployment, it updates a belief over environment dynamics using real-time observations, incorporating uncertainty into MB planning via marginalization. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.

Paperid:2044 poster

Authors:Firas Laakom, Haobo Chen, Jürgen Schmidhuber, Yuheng Bu

Title: Fairness Overfitting in Machine Learning: An Information-Theoretic Perspective

Abstract:
Despite substantial progress in promoting fairness in high-stake applications using machine learning models, existing methods often modify the training process, such as through regularizers or other interventions, but lack formal guarantees that fairness achieved during training will generalize to unseen data. Although overfitting with respect to prediction performance has been extensively studied, overfitting in terms of fairness loss has received far less attention. This paper proposes a theoretical framework for analyzing fairness generalization error through an information-theoretic lens. Our novel bounding technique is based on Efron–Stein inequality, which allows us to derive tight information-theoretic fairness generalization bounds with both Mutual Information (MI) and Conditional Mutual Information (CMI). Our empirical results validate the tightness and practical relevance of these bounds across diverse fairness-aware learning algorithms. Our framework offers valuable insights to guide the design of algorithms improving fairness generalization.

Paperid:2045 poster

Authors:Shahaf Bassan, Yizhak Yisrael Elboher, Tobias Ladner, Matthias Althoff, Guy Katz

Title: Explaining, Fast and Slow: Abstraction and Refinement of Provable Explanations

Abstract:
Despite significant advancements in post-hoc explainability techniques for neural networks, many current methods rely on heuristics and do not provide formally provable guarantees over the explanations provided. Recent work has shown that it is possible to obtain explanations with formal guarantees by identifying subsets of input features that are sufficient to determine that predictions remain unchanged using neural network verification techniques. Despite the appeal of these explanations, their computation faces significant scalability challenges. In this work, we address this gap by proposing a novel abstraction-refinement technique for efficiently computing provably sufficient explanations of neural network predictions. Our methodabstractsthe original large neural network by constructing a substantially reduced network, where a sufficient explanation of the reduced network is alsoprovably sufficientfor the original network, hence significantly speeding up the verification process. If the explanation is insufficient on the reduced network, we iterativelyrefinethe network size by gradually increasing it until convergence. Our experiments demonstrate that our approach enhances the efficiency of obtaining provably sufficient explanations for neural network predictions while additionally providing a fine-grained interpretation of the network's predictions across different abstraction levels.

Paperid:2046 poster

Authors:Fengqiang Wan, Yang Yang

Title: Probabilistic Group Mask Guided Discrete Optimization for Incremental Learning

Abstract:
Incremental learning (IL) aims to sequentially learn new tasks while mitigating catastrophic forgetting. Among various IL strategies, parameter-isolation methods stand out by using mask techniques to allocate distinct parameters to each task, explicitly addressing forgetting. However, existing approaches often disregard parameter dependencies, resulting in an over-reliance on newly allocated parameters. To address this issue, we propose Probabilistic Group Mask selection (PGM), a group-wise approach that captures parameter dependencies by exploring candidate masks within each group. Specifically, PGM partitions parameters into groups with multiple candidate masks, assigning probabilities to these masks and leveraging Gumbel-Softmax for differentiable sampling, enabling efficient optimization of the discrete mask selection process. Our theoretical analysis demonstrates that incorporating parameter dependencies enhances sub-network selection. Experiments conducted on standard benchmarks confirm its superior effectiveness compared to existing IL approaches. The source code is available at: \url{https://github.com/njustkmg/ICML25-PGM}.

Paperid:2047 poster

Authors:Rui Gao, Weiwei Liu

Title: Model Uncertainty Quantification by Conformal Prediction in Continual Learning

Abstract:
Continual learning has attracted increasing research attention in recent years due to its promising experimental results in real-world applications. In this paper, we study the issue of calibration in continual learning which reliably quantifies the uncertainty of model predictions. Conformal prediction (CP) provides a general framework for model calibration, which outputs prediction intervals or sets with a theoretical high coverage guarantee as long as the samples are exchangeable. However, the tasks in continual learning are learned in sequence, which violates the principle that data should be exchangeable. Meanwhile, the model learns the current task with limited or no access to data from previous tasks, which is not conducive to constructing the calibration set. To address these issues, we propose a CP-based method for model uncertainty quantification in continual learning (CPCL), which also reveals the connection between prediction interval length and forgetting. We analyze the oracle prediction interval in continual learning and theoretically prove the asymptotic coverage guarantee of CPCL. Finally, extensive experiments on simulated and real data empirically verify the validity of our proposed method.

Paperid:2048 poster

Authors:Hung-Chieh Fang, Po-Yi Lu, Hsuan-Tien Lin

Title: Tackling Dimensional Collapse toward Comprehensive Universal Domain Adaptation

Abstract:
Universal Domain Adaptation (UniDA) addresses unsupervised domain adaptation where target classes may differ arbitrarily from source ones, except for a shared subset. A widely used approach, partial domain matching (PDM), aligns only shared classes but struggles in extreme cases where many source classes are absent in the target domain, underperforming the most naive baseline that trains on only source data. In this work, we identify that the failure of PDM for extreme UniDA stems from dimensional collapse (DC) in target representations. To address target DC, we propose to jointly leverage the alignment and uniformity techniques in self-supervised learning on the unlabeled target data to preserve the intrinsic structure of the learned representations. Our experimental results confirm that SSL consistently advances PDM and delivers new state-of-the-art results across a broader benchmark of UniDA scenarios with different portions of shared classes, representing a crucial step toward truly comprehensive UniDA. Project page:https://dc-unida.github.io/

Paperid:2049 poster

Authors:Qiuhao Wang, Yuqi Zha, Chin Pang Ho, Marek Petrik

Title: Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Abstract:
Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the average-reward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.

Paperid:2050 poster

Authors:Kaijun Liu, Sijie Ruan, Liang Zhang, Cheng Long, Shuliang Wang, liang yu

Title: Generative Human Trajectory Recovery via Embedding-Space Conditional Diffusion

Abstract:
Recovering human trajectories from incomplete or missing data is crucial for many mobility-based urban applications, e.g., urban planning, transportation, and location-based services. Existing methods mainly rely on recurrent neural networks or attention mechanisms. Though promising, they encounter limitations in capturing complex spatial-temporal dependencies in low-sampling trajectories. Recently, diffusion models show potential in content generation. However, most of proposed methods are used to generate contents in continuous numerical representations, which cannot be directly adapted to the human location trajectory recovery. In this paper, we introduce a conditional diffusion-based trajectory recovery method, namely, DiffMove. It first transforms locations in trajectories into the embedding space, in which the embedding denoising is performed, and then missing locations are recovered by an embedding decoder. DiffMove not only improves accuracy by introducing high-quality generative methods in the trajectory recovery, but also carefully models the transition, periodicity, and temporal patterns in human mobility. Extensive experiments based on two representative real-world mobility datasets are conducted, and the results show significant improvements (an average of 11% in recall) over the best baselines.

Paperid:2051 poster

Authors:Nian Liu, Xiaoxin He, Thomas Laurent, Francesco Di Giovanni, Michael M. Bronstein, Xavier Bresson

Title: A General Graph Spectral Wavelet Convolution via Chebyshev Order Decomposition

Abstract:
Spectral graph convolution, an important tool of data filtering on graphs, relies on two essential decisions: selecting spectral bases for signal transformation and parameterizing the kernel for frequency analysis. While recent techniques mainly focus on standard Fourier transform and vector-valued spectral functions, they fall short in flexibility to model signal distributions over large spatial ranges, and capacity of spectral function. In this paper, we present a novel wavelet-based graph convolution network, namely WaveGC, which integrates multi-resolution spectral bases and a matrix-valued filter kernel. Theoretically, we establish that WaveGC can effectively capture and decouple short-range and long-range information, providing superior filtering flexibility, surpassing existing graph wavelet neural networks. To instantiate WaveGC, we introduce a novel technique for learning general graph wavelets by separately combining odd and even terms of Chebyshev polynomials. This approach strictly satisfies wavelet admissibility criteria. Our numerical experiments showcase the consistent improvements in both short-range and long-range tasks. This underscores the effectiveness of the proposed model in handling different scenarios.

Paperid:2052 poster

Authors:Akira Ito, Masanori Yamada, Atsutoshi Kumagai

Title: Linear Mode Connectivity between Multiple Models modulo Permutation Symmetries

Abstract:
Ainsworth et al. empirically demonstrated that linear mode connectivity (LMC) can be achieved between two independently trained neural networks (NNs) by applying an appropriate parameter permutation. LMC is satisfied if a linear path with non-increasing test loss exists between the models, suggesting that NNs trained with stochastic gradient descent (SGD) converge to a single approximately convex low-loss basin under permutation symmetries. However, Ainsworth et al. verified LMC for two models and provided only limited discussion on its extension to multiple models. In this paper, we conduct a more detailed empirical analysis. First, we show that existing permutation search methods designed for two models can fail to transfer multiple models into the same convex low-loss basin. Next, we propose a permutation search method using a straight-through estimator for multiple models (STE-MM). We then experimentally demonstrate that even when multiple models are given, the test loss of the merged model remains nearly the same as the losses of the original models when using STE-MM, and the loss barriers between all permuted model pairs are also small. Additionally, from the perspective of the trace of the Hessian matrix, we show that the loss sharpness around the merged model decreases as the number of models increases with STE-MM, indicating that LMC for multiple models is more likely to hold. The source code implementing our method is available athttps://github.com/e5-a/STE-MM.

Paperid:2053 poster

Authors:Wenju Sun, Qingyong Li, Yangliao Geng, Boyang Li

Title: CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging

Abstract:
Multi-task model merging offers a promising paradigm for integrating multiple expert models into a unified system without additional training. Existing state-of-the-art techniques, such as Task Arithmetic and its variants, merge models by accumulating task vectors—defined as the parameter differences between pre-trained and fine-tuned models. However, task vector accumulation is often hindered by knowledge conflicts, where conflicting components across different task vectors can lead to performance degradation during the merging process. To address this challenge, we proposeConflict-Aware Task Merging (CAT Merging), a novel training-free framework that selectively trims conflict-prone components from the task vectors. CAT Merging introduces several parameter-specific strategies, including projection for linear weights and masking for scaling and shifting parameters in normalization layers. Extensive experiments on vision and vision-language tasks demonstrate that CAT Merging effectively suppresses knowledge conflicts, achieving average accuracy improvements of up to 4.7% (ViT-B/32) and 2.0% (ViT-L/14) over state-of-the-art methods.

Paperid:2054 poster

Authors:Jintao Tong, Yixiong Zou, Guangyao Chen, Yuhua Li, Ruixuan Li

Title: Self-Disentanglement and Re-Composition for Cross-Domain Few-Shot Segmentation

Abstract:
Cross-Domain Few-Shot Segmentation (CD-FSS) aims to transfer knowledge from a large-scale source-domain dataset to unseen target-domain datasets with limited annotated samples. Current methods typically compare the distance between training and testing samples for mask prediction. However, a problem of feature entanglement exists in this well-adopted method, which binds multiple patterns together and harms the transferability. However, we find an entanglement problem exists in this widely adopted method, which tends to bind source-domain patterns together and make each of them hard to transfer. In this paper, we aim to address this problem for the CD-FSS task. We first find a natural decomposition of the ViT structure, based on which we delve into the entanglement problem for an interpretation. We find the decomposed ViT components are crossly compared between images in distance calculation, where the rational comparisons are entangled with those meaningless ones by their equal importance, leading to the entanglement problem. Based on this interpretation, we further propose to address the entanglement problem by learning to weigh for all comparisons of ViT components, which learn disentangled features and re-compose them for the CD-FSS task, benefiting both the generalization and finetuning. Experiments show that our model outperforms the state-of-the-art CD-FSS method by 1.92% and 1.88% in average accuracy under 1-shot and 5-shot settings, respectively.

Paperid:2055 poster

Authors:Xiaoyu Wang, Wenhao Yang, Chang Yao, Mingli Song, Yuanyu Wan

Title: Revisiting Differentially Private Algorithms for Decentralized Online Learning

Abstract:
Although the differential privacy (DP) of decentralized online learning has garnered considerable attention recently, existing algorithms are unsatisfactory due to their inability to achieve $(\epsilon, 0)$-DP over all $T$ rounds, recover the optimal regret in the non-private case, and maintain the lightweight computation under complex constraints. To address these issues, we first propose a new decentralized online learning algorithm satisfying $(\epsilon, 0)$-DP over $T$ rounds, and show that it can achieve $\widetilde{O}(n(\rho^{-1/4}+\epsilon^{-1}\rho^{1/4})\sqrt{T})$ and $\widetilde{O}(n (\rho^{-1/2}+\epsilon^{-1}))$ regret bounds for convex and strongly convex functions respectively, where $n$ is the number of local learners and $\rho$ is the spectral gap of the communication matrix. As long as $\epsilon=\Omega(\sqrt{\rho})$, these bounds nearly match existing lower bounds in the non-private case, which implies that $(\epsilon, 0)$-DP of decentralized online learning may be ensured nearly for free. Our key idea is to design a block-decoupled accelerated gossip strategy that can be incorporated with the classical tree-based private aggregation, and also enjoys a faster average consensus among local learners. Furthermore, we develop a projection-free variant of our algorithm to keep the efficiency under complex constraints. As a trade-off, the above regret bounds degrade to $\widetilde{O}(n(T^{3/4}+\epsilon^{-1}T^{1/4}))$ and $\widetilde{O}(n(T^{2/3}+\epsilon^{-1}))$ respectively, which however are even better than the existing private centralized projection-free online algorithm.

Paperid:2056 poster

Authors:Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li

Title: Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

Abstract:
Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization—where models inadvertently replicate exact or near-identical training data—has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation. Our code is available athttps://github.com/fangzy96/TabCutMix.

Paperid:2057 poster

Authors:Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Kanghoon Lee, Woohyung Lim

Title: Agent-Centric Actor-Critic for Asynchronous Multi-Agent Reinforcement Learning

Abstract:
Multi-Agent Reinforcement Learning (MARL) struggles with coordination in sparse reward environments. Macro-actions —sequences of actions executed as single decisions�?facilitate long-term planning but introduce asynchrony, complicating Centralized Training with Decentralized Execution (CTDE). Existing CTDE methods use padding to handle asynchrony, risking misaligned asynchronous experiences and spurious correlations. We propose the Agent-Centric Actor-Critic (ACAC) algorithm to manage asynchrony without padding. ACAC uses agent-centric encoders for independent trajectory processing, with an attention-based aggregation module integrating these histories into a centralized critic for improved temporal abstractions. The proposed structure is trained via a PPO-based algorithm with a modified Generalized Advantage Estimation for asynchronous environments. Experiments show ACAC accelerates convergence and enhances performance over baselines in complex MARL tasks.

Paperid:2058 poster

Authors:Shangbin Feng, Zifeng Wang, Yike Wang, Sayna Ebrahimi, Hamid Palangi, Lesly Miculicich, Achin Kulshrestha, Nathalie Rauschmayr, Yejin Choi, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister

Title: Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Abstract:
We propose Model Swarms, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, Model Swarms offers tuning-free model adaptation, works in low-data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process.

Paperid:2059 poster

Authors:Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, William Yang Wang

Title: MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents

Abstract:
Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent’s next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent’s trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs. Code is available athttps://github.com/kaijiezhu11/MELON.

Paperid:2060 poster

Authors:Dihan Zheng, Bo Huang

Title: Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models

Abstract:
Fluorescence microscopy is ubiquitously used in cell biology research to characterize the cellular role of a protein. To help elucidate the relationship between the amino acid sequence of a protein and its cellular function, we introduce CELL-Diff, a unified diffusion model facilitating bidirectional transformations between protein sequences and their corresponding microscopy images. Utilizing reference cell morphology images and a protein sequence, CELL-Diff efficiently generates corresponding protein images. Conversely, given a protein image, the model outputs protein sequences. CELL-Diff integrates continuous and diffusion models within a unified framework and is implemented using a transformer-based network. We train CELL-Diff on the Human Protein Atlas (HPA) dataset and fine-tune it on the OpenCell dataset. Experimental results demonstrate that CELL-Diff outperforms existing methods in generating high-fidelity protein images, making it a practical tool for investigating subcellular protein localization and interactions.

Paperid:2061 poster

Authors:Yuhui Wang, Qingyuan Wu, Dylan R. Ashley, Francesco Faccio, Weida Li, Chao Huang, Jürgen Schmidhuber

Title: Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

Abstract:
The Value Iteration Network (VIN) is an end-to-end differentiable neural network architecture for planning. It exhibits strong generalization to unseen domains by incorporating a differentiable planning module that operates on a latent Markov Decision Process (MDP). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze---a task that typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introduce an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on 2D/3D maze navigation environments, continuous control, and the real-world Lunar rover navigation task. We find that our new method, named Dynamic Transition VIN (DT-VIN), scales to 5000 layers and solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in complex environments.

Paperid:2062 poster

Authors:Junwei Deng, Weijing Tang, Jiaqi W. Ma

Title: A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

Abstract:
Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution---quantifying how individual training data points affect a model's predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that decompose into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to 1000 times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.

Paperid:2063 poster

Authors:Han-Byul Kim, Duc N.M Hoang, Arnav Kundu, Mohammad Samragh, Minsik Cho

Title: SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Abstract:
With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20% overall inference latency reduction with < 1% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Paperid:2064 poster

Authors:Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto

Title: DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Abstract:
The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, remain challenging to learn and are typically developed for task-specific solutions with online policy learning. To unlock world models' true potential, we argue that they should 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To this end, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic planning by treating goal features as prediction targets. We demonstrate that DINO-WM achieves zero-shot behavioral solutions at test time on six environments without expert demonstrations, reward modeling, or pre-learned inverse models, outperforming prior state-of-the-art work across diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

Paperid:2065 poster

Authors:Sheheryar Mehmood, Peter Ochs

Title: Automatic Differentiation of Optimization Algorithms with Time-Varying Updates

Abstract:
Numerous optimization algorithms have a time-varying update rule thanks to, for instance, a changing step size, momentum parameter or, Hessian approximation. Often, such algorithms are used as solvers for the lower-level problem in bilevel optimization, and are unrolled when computing the gradient of the upper-level objective. In this paper, we apply unrolled or automatic differentiation to a time-varying iterative process and provide convergence (rate) guarantees for the resulting derivative iterates. We then adapt these convergence results and apply them to proximal gradient descent with variable step size and FISTA when solving partly-smooth problems. We test the convergence (rates) of these algorithms numerically through several experiments. Our theoretical and numerical results show that the convergence rate of the algorithm is reflected in its derivative iterates.

Paperid:2066 poster

Authors:Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, Lianhui Qin

Title: Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples

Abstract:
The ability to generate diverse solutions to a given problem is a hallmark of human creativity. This divergent reasoning is also crucial for machines, enhancing their robustness and enabling them to assist humans in many applications such as scientific discovery. However, existing approaches to multi-step reasoning with large language models (LLMs) have mostly focused only on reasoning accuracy, without further discovering more diverse valid solutions. For example, supervised fine-tuning improves reasoning quality but requires vast labeled data, while reward-maximizing reinforcement learning finds top-reward solutions while neglecting the solution diversity. To fill this gap, we propose Flow of Reasoning (FoR), an efficient diversity-seeking LLM finetuning method aimed at improving reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. This formulation allows us to incorporate and adapt principled GFlowNet approaches, for finetuning LLMs to sample divergent paths with probabilities proportional to the (unnormalized) reward of target problems. Extensive experiments show that, with limited training examples (e.g., 15 examples), FoR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including BlocksWorld (embodied reasoning), Game24 (math puzzle solving), Rubik's Cube (spatial reasoning), 1D-ARC (abstraction reasoning), GSM8k (math reasoning), and ProntoQA (logical reasoning). Code is available athttps://github.com/Yu-Fangxu/FoR.

Paperid:2067 poster

Authors:Qunzhong Wang, Xiangguo Sun, Hong Cheng

Title: Does Graph Prompt Work? A Data Operation Perspective with Theoretical Analysis

Abstract:
In recent years, graph prompting has emerged as a promising research direction, enabling the learning of additional tokens or subgraphs appended to original graphs without requiring retraining of pre-trained graph models across various applications. This novel paradigm, shifting from the traditional "pre-training and fine-tuning" to "pre-training and prompting," has shown significant empirical success in simulating graph data operations, with applications ranging from recommendation systems to biological networks and graph transferring. However, despite its potential, the theoretical underpinnings of graph prompting remain underexplored, raising critical questions about its fundamental effectiveness. The lack of rigorous theoretical proof of why and how much it works is more like a "dark cloud" over the graph prompting area for deeper research. To fill this gap, this paper introduces a theoretical framework that rigorously analyzes graph prompting from a data operation perspective. Our contributions are threefold:First, we provide a formal guarantee theorem, demonstrating graph prompts�?capacity to approximate graph transformation operators, effectively linking upstream and downstream tasks.Second, we derive upper bounds on the error of these data operations for a single graph and extend this discussion to batches of graphs, which are common in graph model training.Third, we analyze the distribution of data operation errors, extending our theoretical findings from linear graph models (e.g., GCN) to non-linear graph models (e.g., GAT). Extensive experiments support our theoretical results and confirm the practical implications of these guarantees.

Paperid:2068 poster

Authors:Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, Seungryong Kim

Title: PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting for Novel View Synthesis

Abstract:
We consider the problem of novel view synthesis from unposed images in a single feed-forward. Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS, where we further extend it to offer a practical solution that relaxes common assumptions such as dense image views, accurate camera poses, and substantial image overlaps. We achieve this through identifying and addressing unique challenges arising from the use of pixel-aligned 3DGS: misaligned 3D Gaussians across different views induce noisy or sparse gradients that destabilize training and hinder convergence, especially when above assumptions are not met. To mitigate this, we employ pre-trained monocular depth estimation and visual correspondence models to achieve coarse alignments of 3D Gaussians. We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis. Furthermore, the refined estimates are leveraged to estimate geometry confidence scores, which assess the reliability of 3D Gaussian centers and condition the prediction of Gaussian parameters accordingly. Extensive evaluations on large-scale real-world datasets demonstrate that PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices. We will make the code and weights publicly available.

Paperid:2069 poster

Authors:Brahma S Pavse, Yudong Chen, Qiaomin Xie, Josiah P. Hanna

Title: Stable Offline Value Function Learning with Bisimulation-based Representations

Abstract:
In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixed-point, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can \emph{stabilize} value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (\textsc{krope}). \textsc{krope} uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that \textsc{krope}: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods to improve the stability and accuracy of offline evaluation of reinforcement learning agents.

Paperid:2070 poster

Authors:Negin Golrezaei, Sourav Sahoo

Title: Learning Safe Strategies for Value Maximizing Buyers in Uniform Price Auctions

Abstract:
We study the bidding problem in repeated uniform price multi-unit auctions from the perspective of a singlevalue-maximizingbuyer who aims to maximize their cumulative value over $T$ rounds while adhering to return-on-investment (RoI) constraints in each round. Buyers adopt $m$-uniform biddingformat, where they submit $m$ bid-quantity pairs $(b_i, q_i)$ to demand $q_i$ units at bid $b_i$. We introducesafebidding strategies as those that satisfy RoI constraints in every auction, regardless of competing bids. We show that these strategies depend only on the bidder’s valuation curve, and the bidder can focus on a finite subset of this class without loss of generality. While the number of strategies in this subset is exponential in $m$, we develop a polynomial-time algorithm to learn the optimal safe strategy that achieves sublinear regret in the online setting, where regret is measured against a clairvoyant benchmark that knows the competing bidsa prioriand selects a fixed hindsight optimal safe strategy. We then evaluate the performance of safe strategies against a clairvoyant that selects the optimal strategy from a richer class of strategies in the online setting. In this scenario, we compute therichness ratio, $\alpha\in(0, 1]$ for the class of strategies chosen by the clairvoyant and show that our algorithm, designed to learn safe strategies, achieves $\alpha$-approximate sublinear regret against these stronger benchmarks. Experiments on semi-synthetic data from real-world auctions show that safe strategies substantially outperform the derived theoretical bounds, making them quite appealing in practice.

Paperid:2071 poster

Authors:Xiaoli Tang, Han Yu, Zengxiang Li, Xiaoxiao Li

Title: Multi-Session Budget Optimization for Forward Auction-based Federated Learning

Abstract:
Auction-based Federated Learning (AFL) has emerged as an important research field in recent years. The prevailing strategies for FL data consumers (DCs) assume that the entire team of the required data owners (DOs) for an FL task must be assembled before training can commence. In practice, a DC can trigger the FL training process multiple times. DOs can thus be gradually recruited over multiple FL model training sessions. Existing bidding strategies for AFL DCs are not designed to handle such scenarios. Therefore, the problem of multi-session AFL remains open. To address this problem, we propose the Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MBOS-AFL). Based on hierarchical reinforcement learning, MBOS-AFL jointly optimizes intersession budget pacing and intra-session bidding for AFL DCs, with the objective of maximizing the total utility. Extensive experiments on six benchmark datasets show that it significantly outperforms seven state-of-the-art approaches. On average, MBOS-AFL achieves 12.28% higher utility, 14.52% more data acquired through auctions for a given budget, and 1.23% higher test accuracy achieved by the resulting FL model compared to the best baseline. To the best of our knowledge, it is the first budget optimization decision support method with budget pacing capability designed for DCs in multi-session forward AFL.

Paperid:2072 poster

Authors:Xiangzhe Kong, Zishen Zhang, Ziting Zhang, Rui Jiao, Jianzhu Ma, Wenbing Huang, Kai Liu, Yang Liu

Title: UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design

Abstract:
The design of target-specific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduceUnified generativeModeling of 3DMolecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.

Paperid:2073 poster

Authors:Evripidis Bampis, Bruno Escoffier, Dimitris Fotakis, Panagiotis Patsilinakos, Michalis Xefteris

Title: Polynomial Time Learning Augmented Algorithms for NP-hard Permutation Problems

Abstract:
We consider a learning augmented framework for NP-hard permutation problems. The algorithm has access to predictions telling, given a pairu,vof elements, whetheruis beforevor not in an optimal solution. Building on the work of Braverman and Mossel (SODA 2008), we show that for a class of optimization problems including scheduling, network design and other graph permutation problems, these predictions allow to solve them in polynomial time with high probability, provided that predictions are true with probability at least1/2+ϵ. Moreover, this can be achieved with a parsimonious access to the predictions.

Paperid:2074 poster

Authors:Laixi Shi, Jingchu Gai, Eric Mazumdar, Yuejie Chi, Adam Wierman

Title: Breaking the Curse of Multiagency in Robust Multi-Agent Reinforcement Learning

Abstract:
Standard multi-agent reinforcement learning (MARL) algorithms are vulnerable to sim-to-real gaps. To address this, distributionally robust Markov games (RMGs) have been proposed to enhance robustness in MARL by optimizing the worst-case performance when game dynamics shift within a prescribed uncertainty set. RMGs remains under-explored, from reasonable problem formulation to the development of sample-efficient algorithms. Two notorious and open challenges are the formulation of the uncertainty set and whether the corresponding RMGs can overcome the curse of multiagency, where the sample complexity scales exponentially with the number of agents. In this work, we propose a natural class of RMGs inspired by behavioral economics, where each agent's uncertainty set is shaped by both the environment and the integrated behavior of other agents. We first establish the well-posedness of this class of RMGs by proving the existence of game-theoretic solutions such as robust Nash equilibria and coarse correlated equilibria (CCE). Assuming access to a generative model, we then introduce a sample-efficient algorithm for learning the CCE whose sample complexity scales polynomially with all relevant parameters. To the best of our knowledge, this is the first algorithm to break the curse of multiagency for RMGs, regardless of the uncertainty set formulation.

Paperid:2075 poster

Authors:Songbai Tan, Xuerui Qiu, Yao Shu, Gang Xu, Linrui Xu, Xiangyu Xu, Huiping Zhuang, Ming Li, Fei Yu

Title: WMarkGPT: Watermarked Image Understanding via Multimodal Large Language Models

Abstract:
Invisible watermarking is widely used to protect digital images from unauthorized use. Accurate assessment of watermarking efficacy is crucial for advancing algorithmic development. However, existing statistical metrics, such as PSNR, rely on access to original images, which are often unavailable in text-driven generative watermarking and fail to capture critical aspects of watermarking, particularly visibility. More importantly, these metrics fail to account for potential corruption of image content. To address these limitations, we propose WMarkGPT, the first multimodal large language model (MLLM) specifically designed for comprehensive watermarked image understanding, without accessing original images. WMarkGPT not only predicts watermark visibility but also generates detailed textual descriptions of its location, content, and impact on image semantics, enabling a more nuanced interpretation of watermarked images. Tackling the challenge of precise location description and understanding images with vastly different content, we construct three visual question-answering (VQA) datasets: an object location-aware dataset, a synthetic watermarking dataset, and a real watermarking dataset. We introduce a meticulously designed three-stage learning pipeline to progressively equip WMarkGPT with the necessary abilities. Extensive experiments on synthetic and real watermarking QA datasets demonstrate that WMarkGPT outperforms existing MLLMs, achieving significant improvements in visibility prediction and content description. The datasets and code are released athttps://github.com/TanSongBai/WMarkGPT.

Paperid:2076 poster

Authors:Charles O'Neill, Alim Gumran, David Klindt

Title: Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

Abstract:
A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. Using compressed sensing theory, we prove that an SAE encoder is inherently insufficient for accurate sparse inference, even in solvable cases. We then decouple encoding and decoding processes to empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our results reveal substantial performance gains with minimal compute increases in correct inference of sparse codes. We demonstrate this generalises to SAEs applied to large language models, where more expressive encoders achieve greater interpretability. This work opens new avenues for understanding neural network representations and analysing large language model activations.

Paperid:2077 poster

Authors:Andy Dong, Wei-Ning Chen, Ayfer Ozgur

Title: Leveraging Randomness in Model and Data Partitioning for Privacy Amplification

Abstract:
We study how inherent randomness in the training process—where each sample (or client in federated learning) contributes only to a randomly selected portion of training—can be leveraged for privacy amplification. This includes (1) data partitioning, where a sample participates in only a subset of training iterations, and (2) model partitioning, where a sample updates only a subset of the model parameters. We apply our framework to model parallelism in federated learning, where each client updates a randomly selected subnetwork to reduce memory and computational overhead, and show that existing methods, e.g. model splitting or dropout, provide a significant privacy amplification gain not captured by previous privacy analysis techniques. Additionally, we introduce balanced iteration subsampling, a new data partitioning method where each sample (or client) participates in a fixed number of training iterations. We show that in certain regimes, this method yields stronger privacy amplification than Poisson (i.i.d.) sampling of data (or clients). Our results demonstrate that randomness in the training process, which is structured rather than i.i.d. and interacts with data in complex ways, can be systematically leveraged for nontrivial privacy amplification.

Paperid:2078 poster

Authors:Da Long, Zhitong Xu, Guang Yang, Akil Narayan, Shandian Zhe

Title: Arbitrarily-Conditioned Multi-Functional Diffusion for Multi-Physics Emulation

Abstract:
Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. While machine learning-based surrogate models can offer significant cost reductions, most focus on a single task, such as forward prediction, and typically lack uncertainty quantification --- an essential component in many applications. To overcome these limitations, we propose Arbitrarily-Conditioned Multi-Functional Diffusion (ACM-FD), a versatile probabilistic surrogate model for multi-physics emulation. ACM-FD can perform a wide range of tasks within a single framework, including forward prediction, various inverse problems, and simulating data for entire systems or subsets of quantities conditioned on others. Specifically, we extend the standard Denoising Diffusion Probabilistic Model (DDPM) for multi-functional generation by modeling noise as Gaussian processes (GP). We propose a random-mask based, zero-regularized denoising loss to achieve flexible and robust conditional generation. We induce a Kronecker product structure in the GP covariance matrix, substantially reducing the computational cost and enabling efficient training and sampling. We demonstrate the effectiveness of ACM-FD across several fundamental multi-physics systems.

Paperid:2079 poster

Authors:Andrew Robert Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, Alexandre Drouin

Title: Context is Key: A Benchmark for Forecasting with Essential Textual Information

Abstract:
Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized athttps://servicenow.github.io/context-is-key-forecasting/v0.

Paperid:2080 poster

Authors:Youran Dong, Junfeng Yang, Wei Yao, Jin Zhang

Title: Efficient Curvature-Aware Hypergradient Approximation for Bilevel Optimization

Abstract:
Bilevel optimization is a powerful tool for many machine learning problems, such as hyperparameter optimization and meta-learning. Estimating hypergradients (also known as implicit gradients) is crucial for developing gradient-based methods for bilevel optimization. In this work, we propose a computationally efficient technique for incorporating curvature information into the approximation of hypergradients and present a novel algorithmic framework based on the resulting enhanced hypergradient computation. We provide convergence rate guarantees for the proposed framework in both deterministic and stochastic scenarios, particularly showing improved computational complexity over popular gradient-based methods in the deterministic setting. This improvement in complexity arises from a careful exploitation of the hypergradient structure and the inexact Newton method. In addition to the theoretical speedup, numerical experiments demonstrate the significant practical performance benefits of incorporating curvature information.

Paperid:2081 poster

Authors:Yufei Huang, Changhu Wang, Junjie Tang, Weichi Wu, Ruibin Xi

Title: A Generic Family of Graphical Models: Diversity, Efficiency, and Heterogeneity

Abstract:
Traditional network inference methods, such as Gaussian Graphical Models, which are built on continuity and homogeneity, face challenges when modeling discrete data and heterogeneous frameworks. Furthermore, under high-dimensionality, the parameter estimation of such models can be hindered by the notorious intractability of high-dimensional integrals. In this paper, we introduce a new and flexible device for graphical models, which accommodates diverse data types, including Gaussian, Poisson log-normal, and latent Gaussian copula models. The new device is driven by a new marginally recoverable parametric family, which can be effectively estimated without evaluating the high-dimensional integration in high-dimensional settings thanks to the marginal recoverability. We further introduce a mixture of marginally recoverable models to capture ubiquitous heterogeneous structures. We show the validity of the desirable properties of the models and the effective estimation methods, and demonstrate their advantages over the state-of-the-art network inference methods via extensive simulation studies and a gene regulatory network analysis of real single-cell RNA sequencing data.

Paperid:2082 poster

Authors:Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov

Title: Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Abstract:
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

Paperid:2083 poster

Authors:Mengzhu Wang, houcheng su, Jiao Li, Chuan Li, Nan Yin, Li Shen, Jingcai Guo

Title: GraphCL: Graph-based Clustering for Semi-Supervised Medical Image Segmentation

Abstract:
Semi-supervised learning (SSL) has made notable advancements in medical image segmentation (MIS), particularly in scenarios with limited labeled data and significantly enhancing data utilization efficiency. Previous methods primarily focus on complex training strategies to utilize unlabeled data but neglect the importance of graph structural information. Different from existing methods, we propose a graph-based clustering for semi-supervised medical image segmentation (GraphCL) by jointly modeling graph data structure in a unified deep model. The proposed GraphCL model enjoys several advantages. Firstly, to the best of our knowledge, this is the first work to model the data structure information for semi-supervised medical image segmentation (SSMIS). Secondly, to get the clustered features across different graphs, we integrate both pairwise affinities between local image features and raw features as inputs. Extensive experimental results on three standard benchmarks show that the proposed GraphCL algorithm outperforms state-of-the-art semi-supervised medical image segmentation methods.

Paperid:2084 poster

Authors:Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang

Title: Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

Abstract:
Solving mathematics problems has been an intriguing capability of large language models, and many efforts have been made to improve reasoning by extending reasoning length, such as through self-correction and extensive long chain-of-thoughts. While promising in problem-solving, advanced long reasoning chain models exhibit an undesired single-modal behavior, where trivial questions require unnecessarily tedious long chains of thought. In this work, we propose a way to allow models to be aware of inference budgets by formulating it as utility maximization with respect to an inference budget constraint, hence naming our algorithm Inference Budget-Constrained Policy Optimization (IBPO). In a nutshell, models fine-tuned through IBPO learn to ``understand'' the difficulty of queries and allocate inference budgets to harder ones. With different inference budgets, our best models are able to have a4.14% and5.74% absolute improvement (8.08% and11.2% relative improvement) on MATH500 using2.16x and4.32x inference budgets respectively, relative to LLaMA3.1 8B Instruct. These improvements are approximately2x those of self-consistency under the same budgets.

Paperid:2085 poster

Authors:Artur Back de Luca, George Giapitzakis, Shenghao Yang, Petar Veličković, Kimon Fountoulakis

Title: Positional Attention: Expressivity and Learnability of Algorithmic Computation

Abstract:
There is a growing interest in the ability of neural networks to execute algorithmic tasks (e.g., arithmetic, summary statistics, and sorting). The goal of this work is to better understand the role of attention in Transformers for algorithmic execution. Its importance for algorithmic execution has been studied theoretically and empirically using parallel computational models. Notably, many parallel algorithms communicate between processors solely using positional information. Inspired by this observation, we investigate how Transformers can execute algorithms using positional attention, where attention weights depend exclusively on positional encodings. We prove that Transformers with positional attention (positional Transformers) maintain the same expressivity of parallel computational models, incurring a logarithmic depth cost relative to the input length. We analyze their in-distribution learnability and explore how parameter norms in positional attention affect sample complexity. Our results show that positional Transformers introduce a learning trade-off: while they exhibit better theoretical dependence on parameter norms, certain tasks may require more layers, which can, in turn, increase sample complexity. Finally, we empirically explore the out-of-distribution performance of positional Transformers and find that they perform well in tasks where their underlying algorithmic solution relies on positional information.

Paperid:2086 poster

Authors:Anirudh Sundara Rajan, Yong Jae Lee

Title: Stay-Positive: A Case for Ignoring Real Image Features in Fake Image Detection

Abstract:
Detecting AI-generated images is a challenging yet essential task. A primary difficulty arises from the detector’s tendency to rely on spurious patterns, such as compression artifacts, which can influence its decisions. These issues often stem from specific patterns that the detector associates with the real data distribution, making it difficult to isolate the actual generative traces. We argue that an image should be classified as fake if and only if it contains artifacts introduced by the generative model. Based on this premise, we propose Stay-Positive, an algorithm designed to constrain the detector’s focus to generative artifacts while disregarding those associated with real data. Experimental results demonstrate that detectors trained with Stay-Positive exhibit reduced susceptibility to spurious correlations, leading to improved generalization and robustness to post-processing. Additionally, unlike detectors that associate artifacts with real images, those that focus purely on fake artifacts are better at detecting inpainted real images.

Paperid:2087 poster

Authors:Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, tiancheng hu, zhang qin, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie

Title: EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Abstract:
Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink�?effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8× improvements in Time-To-First-Token (TTFT) and 7× throughput gains over existing systems, with negligible or no accuracy loss.

Paperid:2088 poster

Authors:Adam Block, Alexander Rakhlin, Ayush Sekhari

Title: GaussMark: A Practical Approach for Structural Watermarking of Language Models

Abstract:
Watermarking, the process by which Large Language Model (LLM) servers imbed an imperceptible signal at inference time in order to detect text generated by their own models, has grown in importance due to the significant improvements in natural language processing tasks by modern LLMs. Current approaches are often impractical due to generation latency, detection time, degradation in text quality, or robustness; such problems often arise due to the focus on token level watermarking, which ignores the inherent structure of text. In this work, we introduce a new scheme, GaussMark, that is simple and efficient to implement, has formal statistical guarantees, comes at no cost in generation latency, and embeds the watermark into the weights of the model itself, providing a structural watermark. Our approach is based on Gaussian independence testing and is motivated by recent empirical observations that minor additive corruptions to LLM weights can result in models of identical (or even improved) quality. We provide formal statistical bounds on the validity and power of our procedure and, through an extensive suite of experiments, demonstrate that GaussMark is reliable, efficient, relatively robust to corruption, and can be instantiated with essentially no loss in model quality.

Paperid:2089 poster

Authors:Rickard Karlsson, JH Krijthe

Title: Falsification of Unconfoundedness by Testing Independence of Causal Mechanisms

Abstract:
A major challenge in estimating treatment effects in observational studies is the reliance on untestable conditions such as the assumption of no unmeasured confounding. In this work, we propose an algorithm that can falsify the assumption of no unmeasured confounding in a setting with observational data from multiple heterogeneous sources, which we refer to as environments. Our proposed falsification strategy leverages a key observation that unmeasured confounding can cause observed causal mechanisms to appear dependent. Building on this observation, we develop a novel two-stage procedure that detects these dependencies with high statistical power while controlling false positives. The algorithm does not require access to randomized data and, in contrast to other falsification approaches, functions even under transportability violations when the environment has a direct effect on the outcome of interest. To showcase the practical relevance of our approach, we show that our method is able to efficiently detect confounding on both simulated and semi-synthetic data.

Paperid:2090 poster

Authors:Chunyuan Deng, Ruidi Chang, Hanjie Chen

Title: Learning Distribution-wise Control in Representation Space for Language Models

Abstract:
Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at:https://github.com/chili-lab/D-Intervention.

Paperid:2091 poster

Authors:Yuan Gao, Hao Wu, Ruiqi Shu, huanshuo dong, Fan Xu, Rui Ray Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, Jiahao Wu, Li Qing, Hui Xiong, Xiaomeng Huang

Title: OneForecast: A Universal Framework for Global and Regional Weather Forecasting

Abstract:
Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link: \url{https://github.com/YuanGao-YG/OneForecast}.

Paperid:2092 poster

Authors:Zhen Sun, Lei Tan, Yunhang Shen, Chengmao Cai, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji

Title: FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification

Abstract:
Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: RGB, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.

Paperid:2093 poster

Authors:Fei Long, Xiaoou Li, Jiaming Lv, Haoyuan Yang, Xianjun Cheng, Peihua Li

Title: BDC-CLIP: Brownian Distance Covariance for Adapting CLIP to Action Recognition

Abstract:
Bridging contrastive language-image pre-training (CLIP) to video action recognition has attracted growing interest. Human actions are inherently rich in spatial and temporal contexts, involving dynamic interactions among people, objects, and the environment. Accurately recognizing actions requires effectively capturing these fine-grained elements and modeling their relationships with language. However, most existing methods rely on cosine similarity--practically equivalent to the Pearson correlation coefficient--between global tokens for video-language alignment. As a result, they have limited capacity to model complex dependencies and tend to overlook local tokens that encode critical spatio-temporal cues. To overcome these limitations, we propose BDC-CLIP, a novel framework that leverages Brownian Distance Covariance (BDC) to align visual and textual representations. Our method can capture complex relationships--both linear and nonlinear--between all visual and textual tokens, enabling fine-grained modeling in space, time, and language. BDC-CLIP achieves state-of-the-art performance across zero-shot, few-shot, base-to-novel, and fully supervised action recognition settings, demonstrating its effectiveness and broad applicability.

Paperid:2094 poster

Authors:Tao Tang, Lijun Zhou, Pengkun Hao, Zihang He, Kalok Ho, Shuo Gu, Zhihui Hao, Haiyang Sun, Kun Zhan, Peng Jia, XianPeng Lang, Xiaodan Liang

Title: S2-Track: A Simple yet Strong Approach for End-to-End 3D Multi-Object Tracking

Abstract:
3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception. Recent end-to-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task. However, existing methods are still in the early stages of development and lack systematic improvements, failing to track objects in certain complex scenarios, like occlusions and the small size of target object’s situations. In this paper, we first summarize the current end-to-end 3D MOT framework by decomposing it into three constituent parts: query initialization, query propagation, and query matching. Then we propose corresponding improvements, which lead to a strong yet simple tracker: S2-Track. Specifically, for query initialization, we present 2D-Prompted Query Initialization, which leverages predicted 2D object and depth information to prompt an initial estimate of the object’s 3D location. For query propagation, we introduce an Uncertainty-aware Probabilistic Decoder to capture the uncertainty of complex environment in object prediction with probabilistic attention. For query matching, we propose a Hierarchical Query Denoising strategy to enhance training robustness and convergence. As a result, our S2-Track achieves state-of-the-art performance on nuScenes benchmark, i.e., 66.3% AMOTA on test split, surpassing the previous best end-to-end solution by a significant margin of 8.9% AMOTA. We achieve 1st place on the nuScenes tracking task leaderboard.

Paperid:2095 poster

Authors:Wonkwang Lee, Jongwon Jeong, Taehong Moon, Hyeon-Jong Kim, Jaehyeon Kim, Gunhee Kim, Byeong-Uk Lee

Title: How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects

Abstract:
Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of high-quality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects. To address these challenges, we contribute the following: First, we augment the Truebones Zoo dataset—a high-quality animal motion dataset covering over 70 species—by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis. Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations. Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures. Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on thislink.

Paperid:2096 poster

Authors:Marco Cipriano, Moritz Feuerpfeil, Gerard de Melo

Title: Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision

Abstract:
Scalable Vector Graphics (SVG) is a popular format on the web and in the design industry. However, despite the great strides made in generative modeling, SVG has remained underexplored due to the discrete and complex nature of such data. We introduce GRIMOIRE, a text-guided SVG generative model that is comprised of two modules: A Visual Shape Quantizer (VSQ) learns to map raster images onto a discrete codebook by reconstructing them as vector shapes, and an Auto-Regressive Transformer (ART) models the joint probability distribution over shape tokens, positions and textual descriptions, allowing us to generate vector graphics from natural language. Unlike existing models that require direct supervision from SVG data, GRIMOIRE learns shape image patches using only raster image supervision which opens up vector generative modeling to significantly more data. We demonstrate the effectiveness of our method by fitting GRIMOIRE for closed filled shapes on the MNIST and Emoji, and for outline strokes on icon and font data, surpassing previous image-supervised methods in generative quality and vector-supervised approach in flexibility.

Paperid:2097 poster

Authors:Haoyuan Cai, Zhenghao Peng, Bolei Zhou

Title: Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism

Abstract:
Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions, but current methods impose high cognitive demands on human supervisors. We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations. AIM utilizes a proxy Q-function to mimic the human intervention rule and adjusts intervention requests based on the alignment between agent and human actions. By assigning high Q-values when the agent deviates from the expert and decreasing these values as the agent becomes proficient, the proxy Q-function enables the agent to assess the real-time alignment with the expert and request assistance when needed. Our expert-in-the-loop experiments reveal that AIM significantly reduces expert monitoring efforts in both continuous and discrete control tasks. Compared to the uncertainty-based baseline Thrifty-DAgger, our method achieves a 40% improvement in terms of human take-over cost and learning efficiency. Furthermore, AIM effectively identifies safety-critical states for expert assistance, thereby collecting higher-quality expert demonstrations and reducing overall expert data and environment interactions needed. Code and demo video are available athttps://github.com/metadriverse/AIM.

Paperid:2098 poster

Authors:Fang-Duo Tsai, Shih-Lun Wu, Weijaw Lee, Sheng-Ping Yang, Bo-Rui Chen, Hao-Chung Cheng, Yi-Hsuan Yang

Title: MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners

Abstract:
We propose MuseControlLite, a lightweight mechanism designed to fine-tune text-to-music generation models for precise conditioning using various time-varying musical attributes and reference audio signals. The key finding is that positional embeddings, which have been seldom used by text-to-music generation models in the conditioner for text conditions, are critical when the condition of interest is a function of time. Using melody control as an example, our experiments show that simply adding rotary positional embeddings to the decoupled cross-attention layers increases control accuracy from 56.6% to 61.1%, while requiring 6.75 times fewer trainable parameters than state-of-the-art fine-tuning mechanisms, using the same pre-trained diffusion Transformer model of Stable Audio Open. We evaluate various forms of musical attribute control, audio inpainting, and audio outpainting, demonstrating improved controllability over MusicGen-Large and Stable Audio Open ControlNet at a significantly lower fine-tuning cost, with only 85M trainable parameters. Source code, model checkpoints, and demo examples are available at:https://MuseControlLite.github.io/web/

Paperid:2099 poster

Authors:Ruhan Wang, Zhiyong Wang, Chengkai Huang, Rui Wang, Tong Yu, Lina Yao, John C.S. Lui, Dongruo Zhou

Title: Federated In-Context Learning: Iterative Refinement for Improved Answer Quality

Abstract:
For question-answering (QA) tasks, in-context learning (ICL) enables language models (LMs) to generate responses without modifying their parameters by leveraging examples provided in the input. However, the effectiveness of ICL heavily depends on the availability of high-quality examples, which are often scarce due to data privacy constraints, annotation costs, and distribution disparities. A natural solution is to utilize examples stored on client devices, but existing approaches either require transmitting model parameters—incurring significant communication overhead—or fail to fully exploit local datasets, limiting their effectiveness. To address these challenges, we propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process. Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters. We establish theoretical guarantees for the convergence of Fed-ICL and conduct extensive experiments on standard QA benchmarks, demonstrating that our proposed approach achieves strong performance while maintaining low communication costs.

Paperid:2100 poster

Authors:Marwa Abdulhai, Isadora White, Charlie Victor Snell, Charles Sun, Joey Hong, Yuexiang Zhai, Kelvin Xu, Sergey Levine

Title: LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Abstract:
Large language models (LLMs) provide excellent text-generation capabilities, but standard prompting and generation methods generally do not lead to intentional or goal-directed agents and might necessitate considerable prompt tuning. Even the best current LLMs rarely ask clarifying questions, engage in explicit information gathering, or take actions that lead to better decisions after multiple turns. Reinforcement learning has the potential to leverage the powerful modeling capabilities of LLMs, as well as their internal representation of textual interactions, to create capable goal-directed language agents. This can enable intentional and temporally extended interactions, such as with humans, the emergence of complex skills such as persuasion, and long-horizon strategic behavior, such as in the context of games. Enabling this requires the community to develop reliable reinforcement learning algorithms for training LLMs. Developing such algorithms requires tasks that can gauge progress on algorithm design, provide accessible and reproducible evaluations for multi-turn interactions, and cover a range of task properties and challenges in improving reinforcement learning algorithms. Our paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs, together with an open-source research framework for getting started on multi-turn RL with offline value-based and online policy-based RL methods. Our benchmark consists of 3 Interactive Dialogue tasks and 5 RL Capability tests for a total of 8 tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.

Paperid:2101 poster

Authors:Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su

Title: From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Abstract:
Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs. Code and data are available athttps://github.com/OSU-NLP-Group/HippoRAG.

Paperid:2102 poster

Authors:Jiaqi Yan, Changping Wang, De Ma, Huajin Tang, Qian Zheng, Gang Pan

Title: Training High Performance Spiking Neural Network by Temporal Model Calibration

Abstract:
Spiking Neural Networks (SNNs) are considered promising energy-efficient models due to their dynamic capability to process spatial-temporal spike information. Existing work has demonstrated that SNNs exhibit temporal heterogeneity, which leads to diverse outputs of SNNs at different time steps and has the potential to enhance their performance. Although SNNs obtained by direct training methods achieve state-of-the-art performance, current methods introduce limited temporal heterogeneity through the dynamics of spiking neurons or network structures. They lack the improvement of temporal heterogeneity through the lens of the gradient. In this paper, we first conclude that the diversity of the temporal logit gradients in current methods is limited. This leads to insufficient temporal heterogeneity and results in temporally miscalibrated SNNs with degraded performance. Based on the above analysis, we propose a Temporal Model Calibration (TMC) method, which can be seen as a logit gradient rescaling mechanism across time steps. Experimental results show that our method can improve the temporal logit gradient diversity and generate temporally calibrated SNNs with enhanced performance. In particular, our method achieves state-of-the-art accuracy on ImageNet, DVSCIFAR10, and N-Caltech101. Codes are available athttps://github.com/zju-bmi-lab/TMC.

Paperid:2103 poster

Authors:Qingchuan Ma, Yuhang Wu, Xiawu Zheng, Rongrong Ji

Title: Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Abstract:
In this paper, we aim to establish a simple, effective, and theoretically grounded benchmark for rigorously probing abstract reasoning in Large Language Models (LLMs). To achieve this, we first develop a mathematic framework that defines abstract reasoning as the ability to: (i) extract essential patterns independent of surface representations, and (ii) apply consistent rules to these abstract patterns. Based on this framework, we introduce two novel complementary metrics: Γ measures basic reasoning accuracy, while �?quantifies a model's reliance on specific symbols rather than underlying patterns - a key indicator of true abstraction versus mere memorization. To implement this measurement, we design a benchmark: systematic symbol remapping in rule-based tasks, which forces models to demonstrate genuine pattern recognition beyond superficial token matching. Extensive LLM evaluations using this benchmark (commercial API models, 7B-70B, multi-agent) reveal:1) critical limitations in non-decimal arithmetic and symbolic reasoning; 2) persistent abstraction gaps despite chain-of-thought prompting; and 3) �?s effectiveness in robustly measuring memory dependence by quantifying performance degradation under symbol remapping, particularly highlighting operand-specific memorization. These findings underscore that current LLMs, despite domain-specific strengths, still lack robust abstract reasoning, highlighting key areas for future improvement.

Paperid:2104 poster

Authors:Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant J. Nair, Poulami Das

Title: Dialogue Without Limits: Constant-Sized KV Caches for Extended Response in LLMs

Abstract:
Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias.We propose ${MorphKV}$, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, which is crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9% memory savings and 18.2% higher accuracy on average compared to state-of-the-art prior works, enabling efficient deployment.

Paperid:2105 poster

Authors:Alexandre Verine, Florian Le Bronnec, Kunhao Zheng, Alexandre Allauzen, Yann Chevaleyre, benjamin negrevergne

Title: Improving Diversity in Language Models: When Temperature Fails, Change the Loss

Abstract:
Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques.

Paperid:2106 poster

Authors:Junlin Wang, Roy Xie, Shang Zhu, Jue WANG, Ben Athiwaratkun, Bhuwan Dhingra, Shuaiwen Leon Song, Ce Zhang, James Zou

Title: Improving Model Alignment Through Collective Intelligence of Open-Source Models

Abstract:
Building helpful and harmless large language models (LLMs) requires effective model alignment approach based on human instructions and feedback, which necessitates high-quality human-labeled data. Constructing such datasets is often expensive and hard to scale, and may face potential limitations on diversity and generalization. To address these challenges, we introduce Mixture of Agents Alignment (MoAA), that leverages the collective strengths of various language models to provide high-quality data for model alignment. By employing MoAA, we enhance both supervised fine-tuning and preference optimization, leading to improved performance compared to using a single model alone to generate alignment data (e.g. using GPT-4o alone). Evaluation results show that our approach can improve win rate of LLaMA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on AlpacaEval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that MoAA enables a self-improvement pipeline, where models fine-tuned on MoA-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.

Paperid:2107 poster

Authors:Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, zhiqiang xu

Title: Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

Abstract:
The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle:Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduceSelective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, surpassing a series of DPO variants with different algorithmic adjustments. These results together illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available athttps://github.com/glorgao/SelectiveDPO

Paperid:2108 poster

Authors:Javan Tahir, Surya Ganguli, Grant M. Rotskoff

Title: Features are fate: a theory of transfer learning in high-dimensional regression

Abstract:
With the emergence of large-scale pre-trained neural networks, methods to adapt such "foundation" models to data-limited downstream tasks have become a necessity. Fine-tuning, preference optimization, and transfer learning have all been successfully employed for these purposes when the target task closely resembles the source task, but a precise theoretical understanding of ``task similarity'' is still lacking. We adopt a \emph{feature-centric} viewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks.

Paperid:2109 poster

Authors:Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan

Title: Can Large Language Models Understand Intermediate Representations in Compilers?

Abstract:
Intermediate Representations (IRs) play a critical role in compiler design and program analysis, yet their comprehension byLarge Language Models(LLMs) remains underexplored. In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs—GPT-4, GPT-3, DeepSeek, Gemma 2, Llama 3, and Code Llama—in understanding IRs. Specifically, we assess model performance across four core tasks:control flow graph reconstruction,decompilation,code summarization, andexecution reasoning. While LLMs exhibit competence in parsing IR syntax and identifying high-level structures, they consistently struggle with instruction-level reasoning, especially in control flow reasoning, loop handling, and dynamic execution. Common failure modes include misinterpreting branching instructions, omitting critical operations, and relying on heuristic reasoning rather than on precise instruction-level logic. Our findings highlight the need for IR-specific enhancements in LLM design. We recommend fine-tuning on structured IR datasets and integrating control-flow-sensitive architectures to improve the models�?effectiveness on IR-related tasks. All the experimental data and source code are publicly available athttps://github.com/hjiang13/LLM4IR.

Paperid:2110 poster

Authors:Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B Cohen, David Krueger, Fazl Barez

Title: PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data

Abstract:
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

Paperid:2111 poster

Authors:Ahmed BOUGHDIRI, Julie Josse, Erwan Scornet

Title: Quantifying Treatment Effects: Estimating Risk Ratios via Observational Studies

Abstract:
The Risk Difference (RD), an absolute measure of effect, is widely used and well-studied in both randomized controlled trials (RCTs) and observational studies. Complementary to the RD, the Risk Ratio (RR), as a relative measure, is critical for a comprehensive understanding of intervention effects: RD can downplay small absolute changes, while RR can highlight them. Despite its significance, the theoretical study of RR has received less attention, particularly in observational settings. This paper addresses this gap by tackling the estimation of RR in observational data. We propose several RR estimators and establish their theoretical properties, including asymptotic normality and confidence intervals. Through analyses on simulated and real-world datasets, we evaluate the performance of these estimators in terms of bias, efficiency, and robustness to generative data models. We also examine the coverage and length of the associated confidence intervals. Due to the non-linear nature of RR, influence function theory yields two distinct efficient estimators with different convergence assumptions. Based on theoretical and empirical insights, we recommend, among all estimators, one of the two doubly-robust estimators, which, intriguingly, challenges conventional expectations.

Paperid:2112 poster

Authors:Maximilian Graf, Thuot Victor, Nicolas Verzelen

Title: Clustering Items through Bandit Feedback: Finding the Right Feature out of Many

Abstract:
We study the problem of clustering a set of items based on bandit feedback. Each of the $n$ items is characterized by a feature vector, with a possibly large dimension $d$. The items are partitioned into two unknown groups, such that items within the same group share the same feature vector. We consider a sequential and adaptive setting in which, at each round, the learner selects one item and one feature, then observes a noisy evaluation of the item's feature. The learner's objective is to recover the correct partition of the items, while keeping the number of observations as small as possible. We provide an algorithm which relies on finding a relevant feature for the clustering task, leveraging the Sequential Halving algorithm. With probability at least $1-\delta$, we obtain an accurate recovery of the partition and derive an upper bound on the required budget . Furthermore, we derive an instance-dependent lower bound, which is tight in some relevant cases.

Paperid:2113 poster

Authors:Matei Gabriel Cosa, Marek Elias

Title: Learning-Augmented Algorithms for MTS with Bandit Access to Multiple Predictors

Abstract:
Combining algorithms is one of the key techniques in learning-augmented algorithms. We consider the following problem: We are given $\ell$ heuristics for Metrical Task Systems (MTS), where each might be tailored to a different type of input instances. While processing an input instance received online, we are allowed to query the action of only one of the heuristics at each time step. Our goal is to achieve performance comparable to the best of the given heuristics. The main difficulty of our setting comes from the fact that the cost paid by a heuristic at time $t$ cannot be estimated unless the same heuristic was also queried at time $t-1$. This is related to Bandit Learning against memory bounded adversaries (Arora et al., 2012). We show how to achieve regret of $O(\text{OPT}^{2/3})$ and prove a tight lower bound based on the construction of Dekel et al. (2013).

Paperid:2114 poster

Authors:Vladimir R Kostic, Karim Lounici, Hélène Halconruy, Timothée Devergne, Pietro Novelli, Massimiliano Pontil

Title: Laplace Transform Based Low-Complexity Learning of Continuous Markov Semigroups

Abstract:
Markov processes serve as universal models for many real-world random processes. This paper presents a data-driven approach to learning these models through the spectral decomposition of the infinitesimal generator (IG) of the Markov semigroup. Its unbounded nature complicates traditional methods such as vector-valued regression and Hilbert-Schmidt operator analysis. Existing techniques, including physics-informed kernel regression, are computationally expensive and limited in scope, with no recovery guarantees for transfer operator methods when the time-lag is small. We propose a novel method leveraging the IG's resolvent, characterized by the Laplace transform of transfer operators. This approach is robust to time-lag variations, ensuring accurate eigenvalue learning even for small time-lags. Our statistical analysis applies to a broader class of Markov processes than current methods while reducing computational complexity from quadratic to linear in the state dimension. Finally, we demonstrate our theoretical findings in several experiments.

Paperid:2115 poster

Authors:Jinuk Kim, Marwa El Halabi, Wonpyo Park, Clemens JS Schaefer, Deokjae Lee, Yeonhong Park, Jae W. Lee, Hyun Oh Song

Title: GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Abstract:
Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code athttps://github.com/snu-mllab/GuidedQuant.

Paperid:2116 poster

Authors:Narmeen Fatimah Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir HARRASSE, Amir Abdullah

Title: Activation Space Interventions Can Be Transferred Between Large Language Models

Abstract:
The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models' outputs in a predictable way. Additionally, we propose a new task, corrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable "lightweight safety switches", allowing dynamic toggling between model behaviors.

Paperid:2117 poster

Authors:Yanbei Liu, Chongxu Wang, Zhitao Xiao, Lei Geng, Yanwei Pang, Xiao Wang

Title: HGOT: Self-supervised Heterogeneous Graph Neural Network with Optimal Transport

Abstract:
Heterogeneous Graph Neural Networks (HGNNs), have demonstrated excellent capabilities in processing heterogeneous information networks. Self-supervised learning on heterogeneous graphs, especially contrastive self-supervised strategy, shows great potential when there are no labels. However, this approach requires the use of carefully designed graph augmentation strategies and the selection of positive and negative samples. Determining the exact level of similarity between sample pairs is non-trivial.To solve this problem, we propose a novel self-supervised Heterogeneous graph neural network with Optimal Transport (HGOT) method which is designed to facilitate self-supervised learning for heterogeneous graphs without graph augmentation strategies. Different from traditional contrastive self-supervised learning, HGOT employs the optimal transport mechanism to relieve the laborious sampling process of positive and negative samples. Specifically, we design an aggregating view (central view) to integrate the semantic information contained in the views represented by different meta-paths (branch views). Then, we introduce an optimal transport plan to identify the transport relationship between the semantics contained in the branch view and the central view. This allows the optimal transport plan between graphs to align with the representations, forcing the encoder to learn node representations that are more similar to the graph space and of higher quality. Extensive experiments on four real-world datasets demonstrate that our proposed HGOT model can achieve state-of-the-art performance on various downstream tasks. In particular, in the node classification task, HGOT achieves an average of more than 6% improvement in accuracy compared with state-of-the-art methods.

Paperid:2118 poster

Authors:Songtao Lu

Title: TSP: A Two-Sided Smoothed Primal-Dual Method for Nonconvex Bilevel Optimization

Abstract:
Extensive research has shown that a wide range of machine learning problems can be formulated as bilevel optimization, where two levels of learning processes intertwine through distinct sets of optimization variables. However, prevailing approaches often impose stringent assumptions, such as strong convexity of the lower-level loss function or uniqueness of the optimal solution, to enable algorithmic development and convergence analysis. However, these assumptions tend to be overly restrictive in real-world scenarios. In this work, we explore a recently popularized Moreau envelope based reformulation of bilevel optimization problems, accommodating nonconvex objective functions at both levels. We propose a stochastic primal-dual method that incorporates smoothing on both sides, capable of finding Karush-Kuhn-Tucker solutions for this general class of nonconvex bilevel optimization problems. A key feature of our algorithm is its ability to dynamically weigh the lower-level problems, enhancing its performance, particularly in stochastic learning scenarios. Numerical experiments underscore the superiority of our proposed algorithm over existing penalty-based methods in terms of both the convergence rate and the test accuracy.

Paperid:2119 poster

Authors:Xu Zhang, Kaidi Xu, Ziqing Hu, Ren Wang

Title: Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach

Abstract:
Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dual-model strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods.

Paperid:2120 poster

Authors:Michael Crawshaw, Blake Woodworth, Mingrui Liu

Title: Constant Stepsize Local GD for Logistic Regression: Acceleration by Instability

Abstract:
Existing analysis of Local (Stochastic) Gradient Descent for heterogeneous objectives requires stepsizes $\eta \leq 1/K$ where $K$ is the communication interval, which ensures monotonic decrease of the objective. In contrast, we analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize $\eta > 0$. With $R$ communication rounds and $M$ clients, we show convergence at a rate $\mathcal{O}(1/\eta K R)$ after an initial unstable phase lasting for $\widetilde{\mathcal{O}}(\eta K M)$ rounds. This improves upon the existing $\mathcal{O}(1/R)$ rate for general smooth, convex objectives. Our analysis parallels the single machine analysis of Wu et al. (2024) in which instability is caused by extremely large stepsizes, but in our setting another source of instability is large local updates with heterogeneous objectives.

Paperid:2121 poster

Authors:Felipe Pinto Coelho Nuti, Tim Franzmeyer, Joao F. Henriques

Title: TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Abstract:
Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively and systematically analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method takes into account the model's intermediate hidden states, giving a more fine-grained insight into the effects of fine-tuning than a simple comparison of the final outputs of pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution ($\mathrm{TuCo}$) in terms of the ratio of the magnitudes fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that $\mathrm{TuCo}$ is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, $\mathrm{TuCo}$ enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa.

Paperid:2122 poster

Authors:Michael Kirchhof, James Thornton, Louis Béthune, Pierre Ablin, Eugene Ndiaye, marco cuturi

Title: Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency

Abstract:
The adoption of text-to-image diffusion models raises concerns over reliability, drawing scrutiny under the lens of various metrics like calibration, fairness, or compute efficiency. We focus in this work on two issues that arise when deploying these models: a lack of diversity when prompting images, and a tendency to recreate images from the training set. To solve both problems, we propose a method that coaxes the sampled trajectories of pretrained diffusion models to land on images that fall outside of a reference set. We achieve this by adding repellency terms to the diffusion SDE throughout the generation trajectory, which are triggered whenever the path is expected to land too closely to an image in the shielded reference set. Our method is sparse in the sense that these repellency terms are zero and inactive most of the time, and even more so towards the end of the generation trajectory. Our method, named SPELL for sparse repellency, can be used either with a static reference set that contains protected images, or dynamically, by updating the set at each timestep with the expected images concurrently generated within a batch, and with the images of previously generated batches. We show that adding SPELL to popular diffusion models improves their diversity while impacting their FID only marginally, and performs comparatively better than other recent training-free diversity methods. We also demonstrate how SPELL can ensure a shielded generation away from a very large set of protected images by considering all 1.2M images from ImageNet as the protected set.

Paperid:2123 poster

Authors:Thomas Schmied, Thomas Adler, Vihang Prakash Patil, Maximilian Beck, Korbinian Pöppel, Johannes Brandstetter, Günter Klambauer, Razvan Pascanu, Sepp Hochreiter

Title: A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks

Abstract:
In recent years, there has been a trend in the field of Reinforcement Learning (RL) towards large action models trained offline on large-scale datasets via sequence modeling. Existing models are primarily based on the Transformer architecture, which results in powerful agents. However, due to slow inference times, Transformer-based approaches are impractical for real-time applications, such as robotics. Recently, modern recurrent architectures, such as xLSTM and Mamba, have been proposed that exhibit parallelization benefits during training similar to the Transformer architecture while offering fast inference. In this work, we study the aptitude of these modern recurrent architectures for large action models. Consequently, we propose a Large Recurrent Action Model (LRAM) with an xLSTM at its core that comes with linear-time inference complexity and natural sequence length extrapolation abilities. Experiments on 432 tasks from 6 domains show that LRAM compares favorably to Transformers in terms of performance and speed.

Paperid:2124 poster

Authors:Lily H Zhang, Rajesh Ranganath

Title: Preference learning made easy: Everything should be understood through win rate

Abstract:
Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.

Paperid:2125 poster

Authors:David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein

Title: Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

Abstract:
Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) �?even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.

Paperid:2126 poster

Authors:Tao Zhang, Zhenhai Liu, Feipeng Qi, Yongjun Jiao, Tailin Wu

Title: M2PDE: Compositional Generative Multiphysics and Multi-component PDE Simulation

Abstract:
Multiphysics simulation, which models the interactions between multiple physical processes, and multi-component simulation of complex structures are critical in fields like nuclear and aerospace engineering. Previous studies use numerical solvers or ML-based surrogate models for these simulations. However, multiphysics simulations typically require integrating multiple specialized solvers-each for a specific physical process-into a coupled program, which introduces significant development challenges. Furthermore, existing numerical algorithms struggle with highly complex large-scale structures in multi-component simulations. Here we propose compositional Multiphysics and Multi-component PDE Simulation with Diffusion models (M2PDE) to overcome these challenges. During diffusion-based training, M2PDE learns energy functions modeling the conditional probability of one physical process/component conditioned on other processes/components. In inference, M2PDE generates coupled multiphysics and multi-component solutions by sampling from the joint probability distribution. We evaluate M2PDE on two multiphysics tasks-reaction-diffusion and nuclear thermal coupling--where it achieves more accurate predictions than surrogate models in challenging scenarios. We then apply it to a multi-component prismatic fuel element problem, demonstrating that M2PDE scales from single-component training to a 64-component structure and outperforms existing domain-decomposition and graph-based approaches. The code is available at github.com/AI4Science-WestlakeU/M2PDE.

Paperid:2127 poster

Authors:Da Kuang, GuanWen Qiu, Junhyong Kim

Title: Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning

Abstract:
How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells replicate to generate cell lineages and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage histories, which provides an analytical framework for dissecting individual cells' molecular decisions during replication and differentiation (i.e., acquisition of specialized traits). Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. By contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems. Here, we introduceCellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating the lineage reconstruction problem as tree-metric learning, we systematically explore weakly supervised training settings at different levels of information and present theCell Lineage Reconstruction Benchmarkto facilitate comprehensive evaluation. This benchmark includes (1) synthetic data modeled via Brownian motion with independent noise and spurious signals; (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show thatCellTreeQMrecovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage. Code and benchmarks are available at:https://kuang-da.github.io/CellTreeQM-page

Paperid:2128 poster

Authors:Weiyu Chen, James Kwok

Title: Pareto Merging: Multi-Objective Optimization for Preference-Aware Model Merging

Abstract:
Model merging, which combines multiple models into a single model, has gained popularity in recent years. By efficiently integrating the capabilities of various models, this significantly reduces the parameter count and memory usage. However, current methods can only produce one single merged model. This necessitates a performance trade-off due to conflicts among the various models, and the resultant one-size-fits-all model may not align with the preferences of different users who may prioritize certain models over others. To address this issue, we propose preference-aware model merging, and formulate this as a multi-objective optimization problem in which the performance of the merged model on each base model's task is treated as an objective. In a single merging process, the proposed parameter-efficient structure generates a Pareto set of merged models, with each representing a Pareto-optimal solution for a preference. Users can then select merged models tailored to their preferences from this learned Pareto set. Experimental results demonstrate that the proposed Pareto Merging produces diverse trade-off models and achieves higher test accuracy compared to state-of-the-art merging baselines.

Paperid:2129 poster

Authors:Xiaojing Du, Jiuyong Li, Debo Cheng, Lin Liu, Wentao Gao, XIONGREN CHEN, Ziqi Xu

Title: Telling Peer Direct Effects from Indirect Effects in Observational Network Data

Abstract:
Estimating causal effects is crucial for decision-makers in many applications, but it is particularly challenging with observational network data due to peer interactions. Some algorithms have been proposed to estimate causal effects involving network data, particularly peer effects, but they often fail to tell apart diverse peer effects. To address this issue, we propose a general setting which considers both peer direct effects and peer indirect effects, and the effect of an individual's own treatment, and provide the identification conditions of these causal effects. To differentiate these effects, we leverage causal mediation analysis and tailor it specifically for network data. Furthermore, given the inherent challenges of accurately estimating effects in networked environments, we propose to incorporate attention mechanisms to capture the varying influences of different neighbors and to explore high-order neighbor effects using multi-layer graph neural networks (GNNs). Additionally, we employ the Hilbert-Schmidt Independence Criterion (HSIC) to further enhance the model’s robustness and accuracy. Extensive experiments on two semi-synthetic datasets derived from real-world networks and on a dataset from a recommendation system confirm the effectiveness of our approach. Our findings have the potential to improve intervention strategies in networked systems, particularly in social networks and public health.

Paperid:2130 poster

Authors:Tyler Ingebrand, Adam Thorpe, ufuk topcu

Title: Function Encoders: A Principled Approach to Transfer Learning in Hilbert Spaces

Abstract:
A central challenge in transfer learning is designing algorithms that can quickly adapt and generalize to new tasks without retraining. Yet, the conditions of when and how algorithms can effectively transfer to new tasks is poorly characterized. We introduce a geometric characterization of transfer in Hilbert spaces and define three types of inductive transfer: interpolation within the convex hull, extrapolation to the linear span, and extrapolation outside the span. We propose a method grounded in the theory of function encoders to achieve all three types of transfer. Specifically, we introduce a novel training scheme for function encoders using least-squares optimization, prove a universal approximation theorem for function encoders, and provide a comprehensive comparison with existing approaches such as transformers and meta-learning on four diverse benchmarks. Our experiments demonstrate that the function encoder outperforms state-of-the-art methods on four benchmark tasks and on all three types of transfer.

Paperid:2131 poster

Authors:Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel

Title: OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Abstract:
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments. Video, code, checkpoints, and dataset:https://ottervla.github.io/.

Paperid:2132 poster

Authors:Guancheng Wan, Zewen Liu, Xiaojun Shan, Max SY Lau, B. Aditya Prakash, Wei Jin

Title: EARTH: Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph

Abstract:
Effective epidemic forecasting is critical for public health strategies and efficient medical resource allocation, especially in the face of rapidly spreading infectious diseases. However, existing deep-learning methods often overlook the dynamic nature of epidemics and fail to account for the specific mechanisms of disease transmission. In response to these challenges, we introduce an innovative end-to-end framework called Epidemiology-Aware Neural ODE with Continuous Disease Transmission Graph (EARTH) in this paper. To learn continuous and regional disease transmission patterns, we first propose EANO, which seamlessly integrates the neural ODE approach with the epidemic mechanism, considering the complex spatial spread process during epidemic evolution. Additionally, we introduce GLTG to model global infection trends and leverage these signals to guide local transmission dynamically. To accommodate both the global coherence of epidemic trends and the local nuances of epidemic transmission patterns, we build a cross-attention approach to fuse the most meaningful information for forecasting. Through the smooth synergy of both components, EARTH offers a more robust and flexible approach to understanding and predicting the spread of infectious diseases. Extensive experiments show EARTH superior performance in forecasting real-world epidemics compared to state-of-the-art methods. The code is available athttps://github.com/GuanchengWan/EARTH.

Paperid:2133 poster

Authors:Piotr Kubaty, Bartosz Wójcik, Bartłomiej Tomasz Krzepkowski, Monika Michaluk, Tomasz Trzcinski, Jary Pomponi, Kamil Adamczewski

Title: How to Train Your Multi-Exit Model? Analyzing the Impact of Training Strategies

Abstract:
Early exits enable the network's forward pass to terminate early by attaching trainable internal classifiers to the backbone network. Existing early-exit methods typically adopt either a joint training approach, where the backbone and exit heads are trained simultaneously, or a disjoint approach, where the heads are trained separately. However, the implications of this choice are often overlooked, with studies typically adopting one approach without adequate justification. This choice influences training dynamics and its impact remains largely unexplored. In this paper, we introduce a set of metrics to analyze early-exit training dynamics and guide the choice of training strategy. We demonstrate that conventionally used joint and disjoint regimes yield suboptimal performance. To address these limitations, we propose a mixed training strategy: the backbone is trained first, followed by the training of the entire multi-exit network. Through comprehensive evaluations of training strategies across various architectures, datasets, and early-exit methods we present strengths and weaknesses of the early exit training strategies. In particular, we show consistent improvements in performance and efficiency using the proposed mixed strategy.

Paperid:2134 poster

Authors:Wenjing Yan, Xiangyu Zhong, Xiaolu Wang, Ying-Jun Angela Zhang

Title: Momentum-Driven Adaptivity: Towards Tuning-Free Asynchronous Federated Learning

Abstract:
Asynchronous federated learning (AFL) has emerged as a promising solution to address system heterogeneity and improve the training efficiency of federated learning. However, existing AFL methods face two critical limitations: 1) they rely on strong assumptions about bounded data heterogeneity across clients, and 2) they require meticulous tuning of learning rates based on unknown system parameters. In this paper, we tackle these challenges by leveraging momentum-based optimization and adaptive learning strategies. We first propose MasFL, a novel momentum-driven AFL framework that successfully eliminates the need for data heterogeneity bounds by effectively utilizing historical descent directions across clients and iterations. By mitigating the staleness accumulation caused by asynchronous updates, we prove that MasFL achieves state-of- the-art convergence rates with linear speedup in both the number of participating clients and local updates. Building on this foundation, we further introduce AdaMasFL, an adaptive variant that incorporates gradient normalization into local updates. Remarkably, this integration removes all dependencies on problem-specific parameters, yielding a fully tuning-free AFL approach while retaining theoretical guarantees. Extensive experiments demonstrate that AdaMasFL consistently outperforms state-of-the-art AFL methods in run- time efficiency and exhibits exceptional robustness across diverse learning rate configurations and system conditions.

Paperid:2135 poster

Authors:Haoran Luo, Haihong E, Yikai Guo, Qika Lin, Xiaobao Wu, Xinyu Mu, Wenhao Liu, Meina Song, Yifan Zhu, Anh Tuan Luu

Title: KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search

Abstract:
Knowledge Base Question Answering (KBQA) aims to answer natural language questions with a large-scale structured knowledge base (KB). Despite advancements with large language models (LLMs), KBQA still faces challenges in weak KB awareness, imbalance between effectiveness and efficiency, and high reliance on annotated data. To address these challenges, we propose KBQA-o1, a novel agentic KBQA method with Monte Carlo Tree Search (MCTS). It introduces a ReAct-based agent process for stepwise logical form generation with KB environment exploration. Moreover, it employs MCTS, a heuristic search method driven by policy and reward models, to balance agentic exploration's performance and search space. With heuristic exploration, KBQA-o1 generates high-quality annotations for further improvement by incremental fine-tuning. Experimental results show that KBQA-o1 outperforms previous low-resource KBQA methods with limited annotated data, boosting Llama-3.1-8B model's GrailQA F1 performance to 78.5% compared to 48.5% of the previous sota method with GPT-3.5-turbo. Our code is publicly available.

Paperid:2136 poster

Authors:Jiujun He, Huazhen Lin

Title: Olica: Efficient Structured Pruning of Large Language Models without Retraining

Abstract:
Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose an efficient pruning framework for LLMs called Orthogonal Neuron Decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products (i.e., ${\rm W}_q{\rm W}^{\top}_k$ and ${\rm W}_v{\rm W}^{\top}_o$). By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. Moreover, a fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of a pruned layer using two low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

Paperid:2137 poster

Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, ZIHAN LI, Yalin Wang, Aristeidis Sotiras, Abolfazl Razi

Title: FIC-TSC: Learning Time Series Classification with Fisher Information Constraint

Abstract:
Analyzing time series data is crucial to a wide spectrum of applications, including economics, online marketplaces, and human healthcare. In particular, time series classification plays an indispensable role in segmenting different phases in stock markets, predicting customer behavior, and classifying worker actions and engagement levels. These aspects contribute significantly to the advancement of automated decision-making and system optimization in real-world applications. However, there is a large consensus that time series data often suffers from domain shifts between training and test sets, which dramatically degrades the classification performance. Despite the success of (reversible) instance normalization in handling the domain shifts for time series regression tasks, its performance in classification is unsatisfactory. In this paper, we proposeFIC-TSC, a training framework for time series classification that leverages Fisher information as the constraint. We theoretically and empirically show this is an efficient and effective solution to guide the model converges toward flatter minima, which enhances its generalizability to distribution shifts. We rigorously evaluate our method on 30 UEA multivariate and 85 UCR univariate datasets. Our empirical results demonstrate the superiority of the proposed method over 14 recent state-of-the-art methods.

Paperid:2138 poster

Authors:Jonghyun Shin, Namjun Kim, Geonho Hwang, Sejun Park

Title: Minimum Width for Universal Approximation using Squashable Activation Functions

Abstract:
The exact minimum width that allows for universal approximation of unbounded-depth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximateLpfunctions from[0,1]dxtoRdy, the minimum width ismax{dx,dy,2}unlessdx=dy=1; the same bound holds fordx=dy=1if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.

Paperid:2139 poster

Authors:Junwei Su, shan Wu

Title: SBGD: Improving Graph Diffusion Generative Model via Stochastic Block Diffusion

Abstract:
Graph diffusion generative models (GDGMs) have emerged as powerful tools for generating high-quality graphs. However, their broader adoption faces challenges in \emph{scalability and size generalization}. GDGMs struggle to scale to large graphs due to their high memory requirements, as they typically operate in the full graph space, requiring the entire graph to be stored in memory during training and inference. This constraint limits their feasibility for large-scale real-world graphs. GDGMs also exhibit poor size generalization, with limited ability to generate graphs of sizes different from those in the training data, restricting their adaptability across diverse applications. To address these challenges, we propose the stochastic block graph diffusion (SBGD) model, which refines graph representations into a block graph space. This space incorporates structural priors based on real-world graph patterns, significantly reducing memory complexity and enabling scalability to large graphs. The block representation also improves size generalization by capturing fundamental graph structures. Empirical results show that SBGD achieves significant memory improvements (up to 6×) while maintaining comparable or even superior graph generation performance relative to state-of-the-art methods. Furthermore, experiments demonstrate that SBGD better generalizes to unseen graph sizes. The significance of SBGD extends beyond being a scalable and effective GDGM; \emph{it also exemplifies the principle of modularization in generative modelling, offering a new avenue for exploring generative models by decomposing complex tasks into more manageable components.}

Paperid:2140 poster

Authors:Zijie Qiu, Jiaqi Wei, Xiang Zhang, Sheng Xu, Kai Zou, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun

Title: Universal Biological Sequence Reranking for Improved De Novo Peptide Sequencing

Abstract:
De novo peptide sequencing is a critical task in proteomics. However, the performance of current deep learning-based methods is limited by the inherent complexity of mass spectrometry data and the heterogeneous distribution of noise signals, leading to data-specific biases. We present RankNovo, the first deep reranking framework that enhances de novo peptide sequencing by leveraging the complementary strengths of multiple sequencing models. RankNovo employs a list-wise reranking approach, modeling candidate peptides as multiple sequence alignments and utilizing axial attention to extract informative features across candidates. Additionally, we introduce two new metrics, PMD (PeptideMassDeviation) and RMD (ResidualMassDeviation), which offer delicate supervision by quantifying mass differences between peptides at both the sequence and residue levels. Extensive experiments demonstrate that RankNovo not only surpasses its base models used to generate training candidates for reranking pre-training, but also sets a new state-of-the-art benchmark. Moreover, RankNovo exhibits strong zero-shot generalization to unseen models—those whose generations were not exposed during training, highlighting its robustness and potential as a universal reranking framework for peptide sequencing. Our work presents a novel reranking strategy that fundamentally challenges existing single-model paradigms and advances the frontier of accurate de novo sequencing. Our source code is provided on GitHub.

Paperid:2141 poster

Authors:Laure Ciernik, Lorenz Linhardt, Marco Morik, Jonas Dippel, Simon Kornblith, Lukas Muttenthaler

Title: Objective drives the consistency of representational similarity across datasets

Abstract:
The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance, irrespective of the objectives and data modalities used to train these models (Huh et al., 2024). Representational similarity is generally measured for individual datasets and is not necessarily consistent across datasets. Thus, one may wonder whether this convergence of model representations is confounded by the datasets commonly used in machine learning. Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations. We find that the objective function is a crucial factor in determining the consistency of representational similarities across datasets. Specifically, self-supervised vision models learn representations whose relative pairwise similarities generalize better from one dataset to another compared to those of image classification or image-text models. Moreover, the correspondence between representational similarities and the models' task behavior is dataset-dependent, being most strongly pronounced for single-domain datasets. Our work provides a framework for analyzing similarities of model representations across datasets and linking those similarities to differences in task behavior.

Paperid:2142 poster

Authors:Alina Shutova, Vladimir Malinovskii, Vage Egiazarian, Denis Kuznedelev, Denis Mazur, Surkov Nikita, Ivan Ermakov, Dan Alistarh

Title: Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

Abstract:
Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys & Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to "optimally" compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under1relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

Paperid:2143 poster

Authors:Zhoufan Zhu, Ke Zhu

Title: AlphaQCM: Alpha Discovery in Finance with Distributional Reinforcement Learning

Abstract:
For researchers and practitioners in finance, finding synergistic formulaic alphas is very important but challenging. In this paper, we reconsider the discovery of synergistic formulaic alphas from the viewpoint of sequential decision-making, and conceptualize the entire alpha discovery process as a non-stationary and reward-sparse Markov decision process. To overcome the challenges of non-stationarity and reward-sparsity, we propose the AlphaQCM method, a novel distributional reinforcement learning method designed to search for synergistic formulaic alphas efficiently. The AlphaQCM method first learns the Q function and quantiles via a Q network and a quantile network, respectively. Then, the AlphaQCM method applies the quantiled conditional moment method to learn unbiased variance from the potentially biased quantiles. Guided by the learned Q function and variance, the AlphaQCM method navigates the non-stationarity and reward-sparsity to explore the vast search space of formulaic alphas with high efficacy. Empirical applications to real-world datasets demonstrate that our AlphaQCM method significantly outperforms its competitors, particularly when dealing with large datasets comprising numerous stocks.

Paperid:2144 poster

Authors:Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath

Title: A General Framework for Inference-time Scaling and Steering of Diffusion Models

Abstract:
Diffusion models have demonstrated remarkable performance in generative modeling, but generating samples with specific desiderata remains challenging. Existing solutions --- such as fine-tuning, best-of-n sampling, and gradient-based guidance --- are expensive, inefficient, or limited in applicability. In this work, we propose FK steering, a framework for inference-time steering diffusion models with reward functions. In this work, we introduce FK steering, which applies Feynman-Kac interacting particle systems to the inference-time steering of diffusion models with arbitrary reward functions. FK steering works by generating multiple trajectories, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are chosen such that a high score indicates the particle will yield a high-reward sample. We explore various choices of potentials, rewards, and samplers. Steering text-to-image models with a human preference reward, we find that FK steering outperforms fine-tuned models with just 2 particles. Moreover, FK steering a 0.8B parameter model outperforms a 2.6B model, achieving state-of-the-art performance on prompt fidelity. We also steer text diffusion models with rewards for text quality and rare attributes such as toxicity, and find that FK steering generates lower perplexity text and enables gradient-free control. Overall, inference-time scaling and steering of diffusion models, even training-free, provides significant quality and controllability benefits. Code availablehere.

Paperid:2145 poster

Authors:Ze Cheng, Zhuoyu Li, Wang Xiaoqiang, Jianing Huang, Zhizhou Zhang, Zhongkai Hao, Hang Su

Title: Accelerating PDE-Constrained Optimization by the Derivative of Neural Operators

Abstract:
PDE-Constrained Optimization (PDECO) problems can be accelerated significantly by employing gradient-based methods with surrogate models like neural operators compared to traditional numerical solvers. However, this approach faces two key challenges: (1)Data inefficiency: Lack of efficient data sampling and effective training for neural operators, particularly for optimization purpose. (2)Instability: High risk of optimization derailment due to inaccurate neural operator predictions and gradients. To address these challenges, we propose a novel framework: (1)Optimization-oriented training: we leverage data from full steps of traditional optimization algorithms and employ a specialized training method for neural operators. (2)Enhanced derivative learning: We introduce aVirtual-Fourierlayer to enhance derivative learning within the neural operator, a crucial aspect for gradient-based optimization. (3)Hybrid optimization: We implement a hybrid approach that integrates neural operators with numerical solvers, providing robust regularization for the optimization process. Our extensive experimental results demonstrate the effectiveness of our model in accurately learning operators and their derivatives. Furthermore, our hybrid optimization approach exhibits robust convergence.

Paperid:2146 poster

Authors:Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

Title: SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Abstract:
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, leading to cumbersome training and inference pipelines, as well as suboptimal overall generation quality due to error accumulation across stages. In this paper, we proposeSongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes:mixed mode, which generates a mixture of vocals and accompaniment directly, anddual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The code is available athttps://github.com/LiuZH-19/SongGen.

Paperid:2147 poster

Authors:Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson

Title: Customizing the Inductive Biases of Softmax Attention using Structured Matrices

Abstract:
The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention.

Paperid:2148 poster

Authors:Deheng Yuan, Tao Guo, Zhongyi Huang

Title: Distributed Nonparametric Estimation: from Sparse to Dense Samples per Terminal

Abstract:
Consider the communication-constrained problem of nonparametric function estimation, in which each distributed terminal holds multiple i.i.d. samples. Under certain regularity assumptions, we characterize the minimax optimal rates for all regimes, and identify phase transitions of the optimal rates as the samples per terminal vary from sparse to dense. This fully solves the problem left open by previous works, whose scopes are limited to regimes with either dense samples or a single sample per terminal. To achieve the optimal rates, we design a layered estimation protocol by exploiting protocols for the parametric density estimation problem. We show the optimality of the protocol using information-theoretic methods and strong data processing inequalities, and incorporating the classic balls and bins model. The optimal rates are immediate for various special cases such as density estimation, Gaussian, binary, Poisson and heteroskedastic regression models.

Paperid:2149 poster

Authors:Zecheng Hao, Qichao Ma, Kang Chen, Yi Zhang, Zhaofei Yu, Tiejun Huang

Title: Faster and Stronger: When ANN-SNN Conversion Meets Parallel Spiking Calculation

Abstract:
Spiking Neural Network (SNN), as a brain-inspired and energy-efficient network, is currently facing the pivotal challenge of exploring a suitable and efficient learning framework. The predominant training methodologies, namely Spatial-Temporal Back-propagation (STBP) and ANN-SNN Conversion, are encumbered by substantial training overhead or pronounced inference latency, which impedes the advancement of SNNs in scaling to larger networks and navigating intricate application domains. In this work, we propose a novel parallel conversion learning framework, which establishes a mathematical mapping relationship between each time-step of the parallel spiking neurons and the cumulative spike firing rate. We theoretically validate the lossless and sorting properties of the conversion process, as well as pointing out the optimal shifting distance for each step. Furthermore, by integrating the above framework with the distribution-aware error calibration technique, we can achieve efficient conversion towards more general activation functions or training-free circumstance. Extensive experiments have confirmed the significant performance advantages of our method for various conversion cases under ultra-low time latency. To our best knowledge, this is the first work which jointly utilizes parallel spiking calculation and ANN-SNN Conversion, providing a highly promising approach for SNN supervised training. Code is available athttps://github.com/hzc1208/Parallel_Conversion.

Paperid:2150 poster

Authors:Harry Zhang, Luca Carlone

Title: CUPS: Improving Human Pose-Shape Estimators with Conformalized Deep Uncertainty

Abstract:
We introduce CUPS, a novel method for learning sequence-to-sequence 3D human shapes and poses from RGB videos with uncertainty quantification. To improve on top of prior work, we develop a method to {generate and score multiple hypotheses during training}, effectively integrating uncertainty quantification into the learning process. This process results in a deep uncertainty function that is trained end-to-end with the 3D pose estimator. Post-training, the learned deep uncertainty model is used as the conformity score, which can be used to calibrate a conformal predictor in order to {assess} the quality of the output prediction. Since the data in human pose-shape learning is not fully exchangeable, we also {present} two practical bounds for the coverage gap in conformal prediction, developing theoretical backing for the uncertainty bound of our model. Our results indicate that by taking advantage of deep uncertainty with conformal prediction, our method achieves state-of-the-art performance across various metrics and datasets while inheriting the probabilistic guarantees of conformal prediction. Interactive 3D visualization, code, and data will be available athttps://sites.google.com/view/champpp.

Paperid:2151 poster

Authors:Zhengrui Guo, Qichen Sun, Jiabo MA, Lishuang Feng, Jinzhuo Wang, Hao Chen

Title: Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images

Abstract:
Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images. While transformer architectures excel at modeling long-range correlations through self-attention, their quadratic computational complexity makes them impractical for computational pathology applications. Existing solutions like local-global or linear self-attention reduce computational costs but compromise the strong modeling capabilities of full self-attention. In this work, we proposeQuerent,i.e., thequery-awarelong contextual dynamic modeling framework, which achieves a theoretically bounded approximation of full self-attention while delivering practical efficiency. Our method adaptively predicts which surrounding regions are most relevant for each patch, enabling focused yet unrestricted attention computation only with potentially important contexts. By using efficient region-wise metadata computation and importance estimation, our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations. Through comprehensive experiments on biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across over 10 WSI datasets, our method demonstrates superior performance compared to the state-of-the-art approaches. Codes are available athttps://github.com/dddavid4real/Querent.

Paperid:2152 poster

Authors:Zhe Zhao, HaiBin Wen, Pengkun Wang, ShuangWang, Zhenkun Wang, Qingfu Zhang, Yang Wang

Title: Balancing Model Efficiency and Performance: Adaptive Pruner for Long-tailed Data

Abstract:
Long-tailed distribution datasets are prevalent in many machine learning tasks, yet existing neural network models still face significant challenges when handling such data. This paper proposes a novel adaptive pruning strategy, LTAP (Long-Tailed Adaptive Pruner), aimed at balancing model efficiency and performance to better address the challenges posed by long-tailed data distributions. LTAP introduces multi-dimensional importance scoring criteria and designs a dynamic weight adjustment mechanism to adaptively determine the pruning priority of parameters for different classes. By focusing on protecting parameters critical for tail classes, LTAP significantly enhances computational efficiency while maintaining model performance. This method combines the strengths of long-tailed learning and neural network pruning, overcoming the limitations of existing approaches in handling imbalanced data. Extensive experiments demonstrate that LTAP outperforms existing methods on various long-tailed datasets, achieving a good balance between model compression rate, computational efficiency, and classification accuracy. This research provides new insights into solving model optimization problems in long-tailed learning and is significant for improving the performance of neural networks on imbalanced datasets. The code is available athttps://github.com/DataLab-atom/LT-VOTE.

Paperid:2153 poster

Authors:Geonho Hwang, Yeachan Park, Wonyeol Lee, Sejun Park

Title: Floating-Point Neural Networks Can Represent Almost All Floating-Point Functions

Abstract:
Existing works on the expressive power of neural networks typically assume real-valued parameters and exact mathematical operations during the evaluation of networks. However, neural networks run on actual computers can take parameters only from a small subset of the reals and perform inexact mathematical operations with round-off errors and overflows. In this work, we study the expressive power of floating-point neural networks, i.e., networks with floating-point parameters and operations. We first observe that for floating-point neural networks to represent all functions from floating-point vectors to floating-point vectors, it is necessary to distinguish different inputs: the first layer of a network should be able to generate different outputs for different inputs. We also prove that such distinguishability is sufficient, along with mild conditions on activation functions. Our result shows that with practical activation functions, floating-point neural networks can represent floating-point functions from a wide domain to all finite or infinite floats. For example, the domain is all finite floats for Sigmoid and tanh, and it is all finite floats of magnitude less than 1/8 times the largest float for ReLU, ELU, SeLU, GELU, Swish, Mish, and sin.

Paperid:2154 poster

Authors:Zhiqiang Wang, Xiaoyi Wang, Jianqing Liang

Title: CSG-ODE: ControlSynth Graph ODE For Modeling Complex Evolution of Dynamic Graphs

Abstract:
Graph Neural Ordinary Differential Equations (GODE) integrate the Variational Autoencoder (VAE) framework with differential equations, effectively modeling latent space uncertainty and continuous dynamics, excelling in graph data evolution and incompleteness. However, existing GODE face challenges in capturing time-varying relationships and nonlinear node state evolution, which limits their ability to model complex dynamic graphs. To address these issues, we propose the ControlSynth Graph ODE (CSG-ODE). In the VAE encoding phase, CSG-ODE introduces an information transmission-based inter-node importance weighting mechanism, integrating it with latent correlations to guide adaptive graph convolutional recurrent networks for temporal node embedding. During decoding, CSG-ODE employs ODE to model node dynamics, capturing nonlinear evolution through sub-networks with nonlinear activations. For scenarios or prediction tasks that require stability, we extend CSG-ODE to stable CSG-ODE (SCSG-ODE) by constraining weight matrices to learnable anti-symmetric forms, theoretically ensuring enhanced stability. Experiments on traffic, motion capture, and simulated physical systems datasets demonstrate that CSG-ODE outperforms state-of-the-art GODE, while SCSG-ODE achieves both superior performance and optimal stability.

Paperid:2155 poster

Authors:Zebin Wang, Menghan Lin, Bolin Shen, Ken Anderson, Molei Liu, Tianxi Cai, Yushun Dong

Title: CEGA: A Cost-Effective Approach for Graph-Based Model Extraction and Acquisition

Abstract:
Graph Neural Networks (GNNs) have demonstrated remarkable utility across diverse applications, and their growing complexity has made Machine Learning as a Service (MLaaS) a viable platform for scalable deployment. However, this accessibility also exposes GNN to serious security threats, most notably model extraction attacks (MEAs), in which adversaries strategically query a deployed model to construct a high-fidelity replica. In this work, we evaluate the vulnerability of GNNs to MEAs and explore their potential for cost-effective model acquisition in non-adversarial research settings. Importantly, adaptive node querying strategies can also serve a critical role in research, particularly when labeling data is expensive or time-consuming. By selectively sampling informative nodes, researchers can train high-performing GNNs with minimal supervision, which is particularly valuable in domains such as biomedicine, where annotations often require expert input. To address this, we propose a node querying strategy tailored to a highly practical yet underexplored scenario, where bulk queries are prohibited, and only a limited set of initial nodes is available. Our approach iteratively refines the node selection mechanism over multiple learning cycles, leveraging historical feedback to improve extraction efficiency. Extensive experiments on benchmark graph datasets demonstrate our superiority over comparable baselines on accuracy, fidelity, and F1 score under strict query-size constraints. These results highlight both the susceptibility of deployed GNNs to extraction attacks and the promise of ethical, efficient GNN acquisition methods to support low-resource research environments. Our implementation is publicly available athttps://github.com/LabRAI/CEGA.

Paperid:2156 poster

Authors:Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar

Title: UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Abstract:
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.

Paperid:2157 poster

Authors:Yukang Yang, Declan Iain Campbell, Kaixuan Huang, Mengdi Wang, Jonathan D. Cohen, Taylor Whittington Webb

Title: Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

Abstract:
Many recent studies have found evidence for emergent reasoning capabilities in large language models (LLMs), but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we study the internal mechanisms that support abstract reasoning in LLMs. We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers,symbol abstraction headsconvert input tokens to abstract variables based on the relations between those tokens. In intermediate layers,symbolic induction headsperform sequence induction over these abstract variables. Finally, in later layers,retrieval headspredict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.

Paperid:2158 poster

Authors:Menghua Wu, Umesh Padia, Sean H. Murphy, Regina Barzilay, Tommi Jaakkola

Title: Identifying biological perturbation targets through causal differential networks

Abstract:
Identifying variables responsible for changes to a biological system enables applications in drug target discovery and cell engineering. Given a pair of observational and interventional datasets, the goal is to isolate the subset of observed variables that were the targets of the intervention. Directly applying causal discovery algorithms is challenging: the data may contain thousands of variables with as few as tens of samples per intervention, and biological systems do not adhere to classical causality assumptions. We propose a causality-inspired approach to address this practical setting. First, we infer noisy causal graphs from the observational and interventional data. Then, we learn to map the differences between these graphs, along with additional statistical features, to sets of variables that were intervened upon. Both modules are jointly trained in a supervised framework, on simulated and real data that reflect the nature of biological interventions. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets. We also demonstrate significant improvements over current causal discovery methods for predicting soft and hard intervention targets across a variety of synthetic data.

Paperid:2159 poster

Authors:Yunfei Huang, David S. Greenberg

Title: Geometric and Physical Constraints Synergistically Enhance Neural PDE Surrogates

Abstract:
Neural PDE surrogates can improve the cost-accuracy tradeoff of classical solvers, but often generalize poorly to new initial conditions and accumulate errors over time. Physical and symmetry constraints have shown promise in closing this performance gap, but existing techniques for imposing these inductive biases are incompatible with the staggered grids commonly used in computational fluid dynamics. Here we introduce novel input and output layers that respect physical laws and symmetries on the staggered grids, and for the first time systematically investigate how these constraints, individually and in combination, affect the accuracy of PDE surrogates. We focus on two challenging problems: shallow water equations with closed boundaries and decaying incompressible turbulence. Compared to strong baselines, symmetries and physical constraints consistently improve performance across tasks, architectures, autoregressive prediction steps, accuracy measures, and network sizes. Symmetries are more effective than physical constraints, but surrogates with both performed best, even compared to baselines with data augmentation or pushforward training, while themselves benefiting from the pushforward trick. Doubly-constrained surrogates also generalize better to initial conditions and durations beyond the range of the training data, and more accurately predict real-world ocean currents.

Paperid:2160 poster

Authors:Vivienne Huiling Wang, Tinghuai Wang, Joni Pajarinen

Title: Hierarchical Reinforcement Learning with Uncertainty-Guided Diffusional Subgoals

Abstract:
Hierarchical reinforcement learning (HRL) learns to make decisions on multiple levels of temporal abstraction. A key challenge in HRL is that the low-level policy changes over time, making it difficult for the high-level policy to generate effective subgoals. To address this issue, the high-level policy must capture a complex subgoal distribution while also accounting for uncertainty in its estimates. We propose an approach that trains a conditional diffusion model regularized by a Gaussian Process (GP) prior to generate a complex variety of subgoals while leveraging principled GP uncertainty quantification. Building on this framework, we develop a strategy that selects subgoals from both the diffusion policy and GP's predictive mean. Our approach outperforms prior HRL methods in both sample efficiency and performance on challenging continuous control benchmarks.

Paperid:2161 poster

Authors:Joseph Paillard, Angel David REYERO LOBO, Vitaliy Kolodyazhniy, Bertrand Thirion, Denis-Alexander Engemann

Title: Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

Abstract:
Causal machine learning (ML) promises to provide powerful tools for estimating individual treatment effects. While causal methods have placed some emphasis on heterogeneity in treatment response, it is of paramount importance to clarify the nature of this heterogeneity, by highlighting which variables drive it. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference applications with finite sample sizes. We empirically demonstrate the benefits of PermuCATE in simulated and real datasets, including complex settings with high-dimensional, correlated variables.

Paperid:2162 poster

Authors:Dulhan Jayalath, Gilad Landau, Brendan Shillingford, Mark Woolrich, Oiwi Parker Jones

Title: The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning

Abstract:
The past few years have seen remarkable progress in the decoding of speech from brain activity, primarily driven by large single-subject datasets. However, due to individual variation, such as anatomy, and differences in task design and scanning hardware, leveraging data across subjects and datasets remains challenging. In turn, the field has not benefited from the growing number of open neural data repositories to exploit large-scale deep learning. To address this, we develop neuroscience-informed self-supervised objectives, together with an architecture, for learning from heterogeneous brain recordings. Scaling to nearly400 hoursof MEG data and900 subjects, our approach shows generalisation across participants, datasets, tasks, and even tonovelsubjects. It achievesimprovements of 15-27%over state-of-the-art models andmatchessurgicaldecoding performance withnon-invasivedata. These advances unlock the potential for scaling speech decoding models beyond the current frontier.

Paperid:2163 poster

Authors:Junyou Zhu, Langzhou He, Chao Gao, Dongpeng Hou, Zhen Su, Philip S. Yu, Juergen Kurths, Frank Hellmann

Title: SDMG: Smoothing Your Diffusion Models for Powerful Graph Representation Learning

Abstract:
Diffusion probabilistic models (DPMs) have recently demonstrated impressive generative capabilities. There is emerging evidence that their sample reconstruction ability can yield meaningful representations for recognition tasks. In this paper, we demonstrate that the objectives underlying generation and representation learning are not perfectly aligned. Through a spectral analysis, we find that minimizing the mean squared error (MSE) between the original graph and its reconstructed counterpart does not necessarily optimize representations for downstream tasks. Instead, focusing on reconstructing a small subset of features, specifically those capturing global information, proves to be more effective for learning powerful representations. Motivated by these insights, we propose a novel framework, the Smooth Diffusion Model for Graphs (SDMG), which introduces a multi-scale smoothing loss and low-frequency information encoders to promote the recovery of global, low-frequency details, while suppressing irrelevant high-frequency noise. Extensive experiments validate the effectiveness of our method, suggesting a promising direction for advancing diffusion models in graph representation learning.

Paperid:2164 poster

Authors:İlker Işık, Ramazan Gokberk Cinbis, Ebru Aydin Gol

Title: Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence

Abstract:
Language models lack the notion of interchangeable tokens: symbols that are semantically equivalent yet distinct, such as bound variables in formal logic. This limitation prevents generalization to larger vocabularies and hinders the model's ability to recognize alpha-equivalence, where renaming bound variables preserves meaning. We formalize this machine learning problem and introduce alpha-covariance, a metric for evaluating robustness to such transformations. To tackle this task, we propose a dual-part token embedding strategy: a shared component ensures semantic consistency, while a randomized component maintains token distinguishability. Compared to a baseline that relies on alpha-renaming for data augmentation, our approach demonstrates improved generalization to unseen tokens in linear temporal logic solving, propositional logic assignment prediction, and copying with an extendable vocabulary, while introducing a favorable inductive bias for alpha-equivalence. Our findings establish a foundation for designing language models that can learn interchangeable token representations, a crucial step toward more flexible and systematic reasoning in formal domains. Our code and project page are available athttps://necrashter.github.io/interchangeable-token-embeddings

Paperid:2165 poster

Authors:Jiashun Liu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Ling Pan

Title: The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Abstract:
Off-policy deep reinforcement learning (RL) agents typically leverage replay buffers for reusing past experiences during learning. This can help sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it has the effect of ``polluting'' the replay buffer with data that can exacerbate optimization challenges in addition to wasting environment interactions due to redundant sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing thesunk cost fallacywhich, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose thelearn to stop(LEAST) mechanism which uses statistics based onQ-values and gradient to guide early episode termination which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.

Paperid:2166 poster

Authors:Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, RongHua Li

Title: Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?

Abstract:
Low rank adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) thanks to its superb efficiency gains over previous methods. While extensive studies have examined the performance and structural properties of LoRA, its behavior upon training-time attacks remain underexplored, posing significant security risks. In this paper, we theoretically investigate the security implications of LoRA's low-rank structure during fine-tuning, in the context of its robustness against data poisoning and backdoor attacks. We propose an analytical framework that models LoRA’s training dynamics, employs the neural tangent kernel to simplify the analysis of the training process, and applies information theory to establish connections between LoRA's low rank structure and its vulnerability against training-time attacks. Our analysis indicates that LoRA exhibits better robustness to backdoor attacks than full fine-tuning, while becomes more vulnerable to untargeted data poisoning due to its over-simplified information geometry. Extensive experimental evaluations have corroborated our theoretical findings.

Paperid:2167 poster

Authors:Drew Prinster, Xing Han, Anqi Liu, Suchi Saria

Title: WATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales

Abstract:
Responsibly deploying artificial intelligence (AI) / machine learning (ML) systems in high-stakes settings arguably requires not only proof of system reliability, but also continual, post-deployment monitoring to quickly detect and address any unsafe behavior. Methods for nonparametric sequential testing---especially conformal test martingales (CTMs) and anytime-valid inference---offer promising tools for this monitoring task. However, existing approaches are restricted to monitoring limited hypothesis classes or ``alarm criteria'' (e.g., detecting data shifts that violate certain exchangeability or IID assumptions), do not allow for online adaptation in response to shifts, and/or cannot diagnose the cause of degradation or alarm. In this paper, we address these limitations by proposing a weighted generalization of conformal test martingales (WCTMs), which lay a theoretical foundation for online monitoring for any unexpected changepoints in the data distribution while controlling false-alarms. For practical applications, we propose specific WCTM algorithms that adapt online to mild covariate shifts (in the marginal input distribution), quickly detect harmful shifts, and diagnose those harmful shifts as concept shifts (in the conditional label distribution) or extreme (out-of-support) covariate shifts that cannot be easily adapted to. On real-world datasets, we demonstrate improved performance relative to state-of-the-art baselines.

Paperid:2168 poster

Authors:Jianting Chen

Title: Efficient Core-set Selection for Deep Learning Through Squared Loss Minimization

Abstract:
Core-set selection (CS) for deep learning has become crucial for enhancing training efficiency and understanding datasets by identifying the most informative subsets. However, most existing methods rely on heuristics or complex optimization, struggling to balance efficiency and effectiveness. To address this, we propose a novel CS objective that adaptively balances losses between core-set and non-core-set samples by minimizing the sum of squared losses across all samples. Building on this objective, we introduce the Maximum Reduction as Maximum Contribution criterion (MRMC), which identifies samples with the maximal reduction in loss as those making the maximal contribution to overall convergence. Additionally, a balance constraint is incorporated to ensure an even distribution of contributions from the core-set. Experimental results demonstrate that MRMC improves training efficiency significantly while preserving model performance with minimal cost.

Paperid:2169 poster

Authors:Abhijeet Mulgund, Chirag Pabbaraju

Title: Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared Loss

Abstract:
The paradigm of weak-to-strong generalization constitutes the training of a strong AI model on data labeled by a weak AI model, with the goal that the strong model nevertheless outperforms its weak supervisor on the target task of interest. For the setting of real-valued regression with the squared loss, recent work quantitatively characterizes the gain in performance of the strong model over the weak model in terms of the misfit between the strong and weak model. We generalize such a characterization to learning tasks whose loss functions correspond to arbitrary Bregman divergences when the strong class is convex. This extends the misfit-based characterization of performance gain in weak-to-strong generalization to classification tasks, as the cross-entropy loss can be expressed in terms of a Bregman divergence. In most practical scenarios, however, the strong model class may not be convex. We therefore weaken this assumption and study weak-to-strong generalization for convex combinations of $k$ strong models in the strong class, in the concrete setting of classification. This allows us to obtain a similar misfit-based characterization of performance gain, up to an additional error term that vanishes as $k$ gets large. Our theoretical findings are supported by thorough experiments on synthetic as well as real-world datasets.

Paperid:2170 poster

Authors:Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, XIAOJUAN QI

Title: SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Abstract:
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise with high accuracy. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: 1) Salience-Determined Bit Allocation adaptively assigns bit-widths to groups within each layer based on their salience; and 2) Salience-Weighted Quantizer Calibration optimizes quantizer parameters by incorporating element-level salience, retain essential information. With its structured group-wise partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while significantly improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM+, which incorporates gradient-based quantization, further reduces perplexity by 35.1%. Our code is available athttps://github.com/Aaronhuang-778/SliM-LLM.

Paperid:2171 poster

Authors:Taesoo Kim, Jinju Kim, Dong Chan Kim, Jong Hwan Ko, Gyeong-Moon Park

Title: Do Not Mimic My Voice : Speaker Identity Unlearning for Zero-Shot Text-to-Speech

Abstract:
The rapid advancement of Zero-Shot Text-to-Speech (ZS-TTS) technology has enabled high-fidelity voice synthesis from minimal audio cues, raising significant privacy and ethical concerns. Despite the threats to voice privacy, research to selectively remove the knowledge to replicate unwanted individual voices from pre-trained model parameters has not been explored. In this paper, we address the new challenge of speaker identity unlearning for ZS-TTS systems. To meet this goal, we propose the first machine unlearning frameworks for ZS-TTS, especially Teacher-Guided Unlearning (TGU), designed to ensure the model forgets designated speaker identities while retaining its ability to generate accurate speech for other speakers. Our proposed methods incorporate randomness to prevent consistent replication of forget speakers' voices, assuring unlearned identities remain untraceable. Additionally, we propose a new evaluation metric, speaker-Zero Retrain Forgetting (spk-ZRF). This assesses the model's ability to disregard prompts associated with forgotten speakers, effectively neutralizing its knowledge of these voices. The experiments conducted on the state-of-the-art model demonstrate that TGU prevents the model from replicating forget speakers' voices while maintaining high quality for other speakers. The demo is available athttps://speechunlearn.github.io/.

Paperid:2172 poster

Authors:Ziang Chen, Xiaohan Chen, Jialin Liu, Xinshang Wang, Wotao Yin

Title: Expressive Power of Graph Neural Networks for (Mixed-Integer) Quadratic Programs

Abstract:
Quadratic programming (QP) is the most widely applied category of problems in nonlinear programming. Many applications require real-time/fast solutions, though not necessarily with high precision. Existing methods either involve matrix decomposition or use the preconditioned conjugate gradient method. For relatively large instances, these methods cannot achieve the real-time requirement unless there is an effective preconditioner. Recently, graph neural networks (GNNs) opened new possibilities for QP. Some promising empirical studies of applying GNNs for QP tasks show that GNNs can capture key characteristics of an optimization instance and provide adaptive guidance accordingly to crucial configurations during the solving process, or directly provide an approximate solution. However, the theoretical understanding of GNNs in this context remains limited. Specifically, it is unclear what GNNs can and cannot achieve for QP tasks in theory. This work addresses this gap in the context of linearly constrained QP tasks. In the continuous setting, we prove that message-passing GNNs can universally represent fundamental properties of quadratic programs, including feasibility, optimal objective values, and optimal solutions. In the more challenging mixed-integer setting, while GNNs are not universal approximators, we identify a subclass of QP problems that GNNs can reliably represent.

Paperid:2173 poster

Authors:Ruiyu Wang, Yu Yuan, Shizhao Sun, Jiang Bian

Title: Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models

Abstract:
Creating Computer-Aided Design (CAD) models requires significant expertise and effort. Text-to-CAD, which converts textual descriptions into CAD parametric sequences, is crucial in streamlining this process. Recent studies have utilized ground-truth parametric sequences, known as sequential signals, as supervision to achieve this goal. However, CAD models are inherently multimodal, comprising parametric sequences and corresponding rendered visual objects. Besides, the rendering process from parametric sequences to visual objects is many-to-one. Therefore, both sequential and visual signals are critical for effective training. In this work, we introduce CADFusion, a framework that uses Large Language Models (LLMs) as the backbone and alternates between two training stages: the sequential learning (SL) stage and the visual feedback (VF) stage. In the SL stage, we train LLMs using ground-truth parametric sequences, enabling the generation of logically coherent parametric sequences. In the VF stage, we reward parametric sequences that render into visually preferred objects and penalize those that do not, allowing LLMs to learn how rendered visual objects are perceived and evaluated. These two stages alternate throughout the training, ensuring balanced learning and preserving benefits of both signals. Experiments demonstrate that CADFusion significantly improves performance, both qualitatively and quantitatively.

Paperid:2174 poster

Authors:Qi Xu, Junyang Zhu, Dongdong Zhou, Hao Chen, Yang Liu, Jiangrong Shen, Qiang Zhang

Title: Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning

Abstract:
Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.

Paperid:2175 poster

Authors:Chuan Liu, Chunshu Wu, Ruibing Song, Ang Li, Ying Nian Wu, Tong Geng

Title: An Expressive and Self-Adaptive Dynamical System for Efficient Function Learning

Abstract:
Function learning forms the foundation of numerous scientific and engineering tasks. While modern machine learning (ML) methods model complex functions effectively, their escalating complexity and computational demands pose challenges to efficient deployment. In contrast, natural dynamical systems exhibit remarkable computational efficiency in representing and solving complex functions. However, existing dynamical system approaches are limited by low expressivity and inefficient training. To this end, we propose EADS, an Expressive and self-Adaptive Dynamical System capable of accurately learning a wide spectrum of functions with extraordinary efficiency. Specifically, (1) drawing inspiration from biological dynamical systems, we integrate hierarchical architectures and heterogeneous dynamics into EADS, significantly enhancing its capacity to represent complex functions. (2) We propose an efficient on-device training method that leverages intrinsic electrical signals to update parameters, making EADS self-adaptive at negligible cost. Experimental results across diverse domains demonstrate that EADS achieves higher accuracy than existing works, while offering orders-of-magnitude speedups and energy efficiency over traditional neural network solutions on GPUs for both inference and training, showcasing its broader impact in overcoming computational bottlenecks across various fields.

Paperid:2176 poster

Authors:Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Richard Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang

Title: ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Abstract:
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate ``visual thoughts'' by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.

Paperid:2177 poster

Authors:Zihan Zhang, Yuxin Chen, Jason D. Lee, Simon Shaolei Du, Ruosong Wang

Title: Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

Abstract:
In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) withSstates,Aactions and a horizon length ofH, we develop an algorithm that enjoys an asymptotically nearly optimal regret ofO~(SAH3K)inKepisodes. To achieve this result, our new technical ingredients include (i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.

Paperid:2178 poster

Authors:Ziyao Wang, Muneeza Azmat, Ang Li, Raya Horesh, Mikhail Yurochkin

Title: Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

Abstract:
Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications. Our code has been released athttps://github.com/ATP-1010/CoSD.

Paperid:2179 poster

Authors:Ziqiao Wang, Cheng Long, Yongyi Mao

Title: Generalization in Federated Learning: A Conditional Mutual Information Framework

Abstract:
Federated learning (FL) is a widely adopted privacy-preserving distributed learning framework, yet its generalization performance remains less explored compared to centralized learning. In FL, the generalization error consists of two components: the out-of-sample gap, which measures the gap between the empirical and true risk for participating clients, and the participation gap, which quantifies the risk difference between participating and non-participating clients. In this work, we apply an information-theoretic analysis via the conditional mutual information (CMI) framework to study FL's two-level generalization. Beyond the traditional supersample-based CMI framework, we introduce a superclient construction to accommodate the two-level generalization setting in FL. We derive multiple CMI-based bounds, including hypothesis-based CMI bounds, illustrating how privacy constraints in FL can imply generalization guarantees. Furthermore, we propose fast-rate evaluated CMI bounds that recover the best-known convergence rate for two-level FL generalization in the small empirical risk regime. For specific FL model aggregation strategies and structured loss functions, we refine our bounds to achieve improved convergence rates with respect to the number of participating clients. Empirical evaluations confirm that our evaluated CMI bounds are non-vacuous and accurately capture the generalization behavior of FL algorithms.

Paperid:2180 poster

Authors:Xinxu Wei, kanhao zhao, Yong Jiao, Hua Xie, Lifang He, Yu Zhang

Title: Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Abstract:
Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this challenge by formulating it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled and labeled as well as high- and low-density EEG data. Our approach introduces a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates the graph contrastive pre-training with the graph masked autoencoder pre-training. Furthermore, we propose a graph topology distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data during pre-training and fine-tuning. This method effectively handles missing electrodes through contrastive distillation. We validate the effectiveness of EEG-DisGCMAE across four classification tasks using two clinical EEG datasets with abundant data.

Paperid:2181 poster

Authors:Xiangqi Li, Libo Huang, Zhulin An, Weilun Feng, Chuanguang Yang, Boyu Diao, Fei Wang, Yongjun Xu

Title: Geometric Feature Embedding for Effective 3D Few-Shot Class Incremental Learning

Abstract:
3D few-shot class incremental learning (FSCIL) aims to learn new point cloud categories from limited samples while preventing the forgetting of previously learned categories. This research area significantly enhances the capabilities of self-driving vehicles and computer vision systems. Existing 3D FSCIL approaches primarily utilize multimodal pre-trained models to extract the semantic features, heavily dependent on meticulously designed high-quality prompts and fine-tuning strategies. To reduce this dependence, this paper proposes a novel method for3DFSCILwithEmbeddedGeometric features (3D-FLEG). Specifically, 3D-FLEG develops a point cloudgeometric feature extraction moduleto capture category-related geometric characteristics. To address the modality heterogeneity issues that arise from integrating geometric and text features, 3D-FLEG introduces ageometric feature embedding module. By augmenting text prompts with spatial geometric features through these modules, 3D-FLEG can learn robust representations of new categories even with limited samples, while mitigating forgetting of the previously learned categories. Experiments conducted on several publicly available 3D point cloud datasets, including ModelNet, ShapeNet, ScanObjectNN, and CO3D, demonstrate 3D-FLEG's superiority over existing state-of-the-art 3D FSCIL methods. Code is available athttps://github.com/lixiangqi707/3D-FLEG.

Paperid:2182 poster

Authors:Konstantina Bairaktari, Jiayun Wu, Steven Wu

Title: Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Abstract:
Conformal prediction is a powerful distribution-free framework for constructing prediction sets with coverage guarantees. Classical methods, such as split conformal prediction, provide marginal coverage, ensuring that the prediction set contains the label of a random test point with a target probability. However, these guarantees may not hold uniformly across different subpopulations, leading to disparities in coverage. Prior work has explored coverage guarantees conditioned on events related to the covariates and label of the test point. We present Kandinsky conformal prediction, a framework that significantly expands the scope of conditional coverage guarantees. In contrast to Mondrian conformal prediction, which restricts its coverage guarantees to disjoint groups—reminiscent of the rigid, structured grids of Piet Mondrian’s art—our framework flexibly handles overlapping and fractional group memberships defined jointly on covariates and labels, reflecting the layered, intersecting forms in Wassily Kandinsky’s compositions. Our algorithm unifies and extends existing methods, encompassing covariate-based group conditional, class conditional, and Mondrian conformal prediction as special cases, while achieving a minimax-optimal high-probability conditional coverage bound. Finally, we demonstrate the practicality of our approach through empirical evaluation on real-world datasets.

Paperid:2183 poster

Authors:Edith Cohen, Mihir Singhal, Uri Stemmer

Title: Breaking the Quadratic Barrier: Robust Cardinality Sketches for Adaptive Queries

Abstract:
Cardinality sketches are compact data structures that efficiently estimate the number of distinct elements across multiple queries while minimizing storage, communication, and computational costs. However, recent research has shown that these sketches can fail under {\em adaptively chosen queries}, breaking down after approximatelyO~(k2)queries, wherekis the sketch size. In this work, we overcome this \emph{quadratic barrier} by designing robust estimators with fine-grained guarantees. Specifically, our constructions can handle an {\em exponential number of adaptive queries}, provided that each element participates in at mostO~(k2)queries. This effectively shifts the quadratic barrier from the total number of queries to the number of queries {\em sharing the same element}, which can be significantly smaller. Beyond cardinality sketches, our approach expands the toolkit for robust algorithm design.

Paperid:2184 poster

Authors:Tyler Clark, Mark Towers, Christine Evers, Jonathon Hare

Title: Beyond The Rainbow: High Performance Deep Reinforcement Learning on a Desktop PC

Abstract:
Rainbow Deep Q-Network (DQN) demonstrated combining multiple independent enhancements could significantly boost a reinforcement learning (RL) agent’s performance. In this paper, we present “Beyond The Rainbow�?(BTR), a novel algorithm that integrates six improvements from across the RL literature to Rainbow DQN, establishing a new state-of-the-art for RL using a desktop PC, with a human-normalized interquartile mean (IQM) of 7.6 on Atari-60. Beyond Atari, we demonstrate BTR’s capability to handle complex 3D games, successfully training agents to play Super Mario Galaxy, Mario Kart, and Mortal Kombat with minimal algorithmic changes. Designing BTR with computational efficiency in mind, agents can be trained using a high-end desktop PC on 200 million Atari frames within 12 hours. Additionally, we conduct detailed ablation studies of each component, analyzing the performance and impact using numerous measures.

Paperid:2185 poster

Authors:Hongyao Tang, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Glen Berseth

Title: Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn

Abstract:
Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this paper, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability induced by the data in each training batch. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.

Paperid:2186 poster

Authors:Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

Title: Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Abstract:
Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness. Our code is available athttps://github.com/AMD-AIG-AIMA/Gumiho.

Paperid:2187 poster

Authors:Xiaoyu Li, Zhao Song, Shenghao Xie

Title: Deterministic Sparse Fourier Transform for Continuous Signals with Frequency Gap

Abstract:
The Fourier transform is a fundamental tool in computer science and signal processing. In particular, when the signal is sparse in the frequency domain---having only $k$ distinct frequencies---sparse Fourier transform (SFT) algorithms can recover the signal in a sublinear time (proportional to the sparsity $k$). Most prior research focused on SFT for discrete signals, designing both randomized and deterministic algorithms for one-dimensional and high-dimensional discrete signals. However, SFT for continuous signals (i.e., $x^*(t)=\sum_{j=1}^k v_j e^{2\pi \mathbf{i} f_j t}$ for $t\in [0,T]$) is a more challenging task. The discrete SFT algorithms are not directly applicable to continuous signals due to the sparsity blow-up from the discretization. Prior to this work, there is a randomized algorithm that achieves an $\ell_2$ recovery guarantee in $\widetilde{O}(k\cdot \mathrm{polylog}(F/\eta))$ time, where $F$ is the band-limit of the frequencies and $\eta$ is the frequency gap. Nevertheless, whether we can solve this problem without using randomness remains open. In this work, we address this gap and introduce the first sublinear-time deterministic sparse Fourier transform algorithm in the continuous setting. Specifically, our algorithm uses $\widetilde{O}(k^2 \cdot \mathrm{polylog}(F/\eta))$ samples and $\widetilde{O}(k^2 \cdot \mathrm{polylog}(F/\eta))$ time to reconstruct the on-grid signal with arbitrary noise that satisfies a mild condition. This is the optimal recovery guarantee that can be achieved by any deterministic approach.

Paperid:2188 poster

Authors:Bowen Zheng, Tianming Yang

Title: Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation

Abstract:
Diffusion distillation is a widely used technique to reduce the sampling cost of diffusion models, yet it often requires extensive training, and the student performance tends to be degraded. Recent studies show that incorporating a GAN objective may alleviate these issues, yet the underlying mechanism remains unclear. In this work, we first identify a key limitation of distillation: mismatched step sizes and parameter numbers between the teacher and the student model lead them to converge to different local minima, rendering direct imitation suboptimal. We further demonstrate that a standalone GAN objective, without relying a distillation loss, overcomes this limitation and is sufficient to convert diffusion models into efficient one-step generators. Based on this finding, we propose that diffusion training may be viewed as a form of generative pre-training, equipping models with capabilities that can be unlocked through lightweight GAN fine-tuning. Supporting this view, we create a one-step generation model by fine-tuning a pre-trained model with 85% of parameters frozen, achieving strong performance with only 0.2M images and near-SOTA results with 5M images. We further present a frequency-domain analysis that may explain the one-step generative capability gained in diffusion training. Overall, our work provides a new perspective for diffusion training, highlighting its role as a powerful generative pre-training process, which can be the basis for building efficient one-step generation models.

Paperid:2189 poster

Authors:yeyunchen, Jiangming Shi

Title: MIPT: Multilevel Informed Prompt Tuning for Robust Molecular Property Prediction

Abstract:
The progress in materials science and drug discovery is impeded by the availability of labeled data and the high costs of manual annotation, driving the need for efficient strategies to capture molecular representations and enable accurate predictions. Pretrained Graph Neural Networks have shown promise in capturing universal molecular representations, but adapting them to task-specific applications remains challenging. In this paper, we propose Multilevel Informed Prompt-Tuning (MIPT), a novel framework for effectively tailoring pretrained models to molecule-related tasks. MIPT utilizes a lightweight, multi-level prompt learning module to capture node-level and graph-level task-specific knowledge, ensuring adaptable and efficient tuning. Additionally, a noise penalty mechanism is introduced to address mismatches between pretrained representations and downstream tasks, reducing irrelevant or noisy information. Experimental results show that MIPT surpasses all baselines, aligning graph space and task space while achieving significant improvements in molecule-related tasks, demonstrating its scalability and versatility for molecular tasks.

Paperid:2190 poster

Authors:Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Title: MoMa: Modulating Mamba for Adapting Image Foundation Models to Video Recognition

Abstract:
Video understanding is a complex challenge that requires effective modeling of spatial-temporal dynamics. With the success of image foundation models (IFMs) in image understanding, recent approaches have explored parameter-efficient fine-tuning (PEFT) to adapt IFMs for video. However, most of these methods tend to process spatial and temporal information separately, which may fail to capture the full intricacy of video dynamics. In this paper, we propose MoMa, an efficient adapter framework that achieves full spatial-temporal modeling by integrating Mamba's selective state space modeling into IFMs. We propose a novel SeqMod operation to inject spatial-temporal information into pre-trained IFMs, without disrupting their original features. By incorporating SeqMod into a Divide-and-Modulate architecture, MoMa enhances video understanding while maintaining computational efficiency. Extensive experiments on multiple video benchmarks demonstrate the effectiveness of MoMa, achieving superior performance with reduced computational cost. Codes will be released upon publication.

Paperid:2191 poster

Authors:Dasol Hong, Wooju Lee, Hyun Myung

Title: CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization

Abstract:
Prompt tuning, which adapts vision-language models by freezing model parameters and opti- mizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA- weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available athttps://github.com/url-kaist/CoCoA-Mix

Paperid:2192 poster

Authors:Mykhail Uss, Ruslan Yermolenko, Oleksii Shashko, Olena Kolodiazhna, Ivan Safonov, Volodymyr Savin, Yoonjae Yeo, Seowon Ji, Jaeyun Jeong

Title: Predicting High-precision Depth on Low-Precision Devices Using 2D Hilbert Curves

Abstract:
Dense depth prediction deep neural networks (DNN) have achieved impressive results for both monocular and binocular data but they are limited by high computational complexity, restricting their use on low-end devices. For better on-device efficiency and hardware utilization, weights and activations of the DNN should be converted to low-bit precision. However, this precision is not sufficient for representing high dynamic range depth. In this paper, we aim to overcome this limitation and restore high-precision depth from low-bit precision predictions. To achieve this, we propose to represent high dynamic range depth as two low dynamic range components of a Hilbert curve, and to train the full precision DNN to directly predict the latter. For on-device deployment, we use standard quantization methods and add a post-processing step that reconstructs depth from the Hilbert curve components predicted in low-bit precision. Extensive experiments demonstrate that our method increases bit precision of predicted depth by up to three bits with little computational overhead. We also observe a positive side effect of quantization error reduction by up to five times. Our method enables effective and accurate depth prediction with DNN weights and activations quantized to eight bit precision.

Paperid:2193 poster

Authors:Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

Title: Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees

Abstract:
Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation—causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategies—untargeted and targeted—which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.

Paperid:2194 poster

Authors:Junze Deng, Qinhang Wu, Peizhong Ju, Sen Lin, Yingbin Liang, Ness Shroff

Title: Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective

Abstract:
Rehearsal-based methods have shown superior performance in addressing catastrophic forgetting in continual learning (CL) by storing and training on a subset of past data alongside new data in current task. While such a concurrent rehearsal strategy is widely used, it remains unclear if this approach is always optimal. Inspired by human learning, where sequentially revisiting tasks helps mitigate forgetting, we explore whether sequential rehearsal can offer greater benefits for CL compared to standard concurrent rehearsal. To address this question, we conduct a theoretical analysis of rehearsal-based CL in overparameterized linear models, comparing two strategies: 1) Concurrent Rehearsal, where past and new data are trained together, and 2) Sequential Rehearsal, where new data is trained first, followed by revisiting past data sequentially. By explicitly characterizing forgetting and generalization error, we show that sequential rehearsal performs better when tasks are less similar. These insights further motivate a novel Hybrid Rehearsal method, which trains similar tasks concurrently and revisits dissimilar tasks sequentially. We characterize its forgetting and generalization performance, and our experiments with deep neural networks further confirm that the hybrid approach outperforms standard concurrent rehearsal. This work provides the first comprehensive theoretical analysis of rehearsal-based CL.

Paperid:2195 poster

Authors:Kyowoon Lee, Jaesik Choi

Title: Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning

Abstract:
Recent advances in diffusion-based generative modeling have demonstrated significant promise in tackling long-horizon, sparse-reward tasks by leveraging offline datasets. While these approaches have achieved promising results, their reliability remains inconsistent due to the inherent stochastic risk of producing infeasible trajectories, limiting their applicability in safety-critical applications. We identify that the primary cause of these failures is inaccurate guidance during the sampling procedure, and demonstrate the existence of manifold deviation by deriving a lower bound on the guidance gap. To address this challenge, we proposeLocal Manifold Approximation and Projection(LoMAP), atraining-freemethod that projects the guided sample onto a low-rank subspace approximated from offline datasets, preventing infeasible trajectory generation. We validate our approach on standard offline reinforcement learning benchmarks that involve challenging long-horizon planning. Furthermore, we show that, as a standalone module, LoMAP can be incorporated into the hierarchical diffusion planner, providing further performance enhancements.

Paperid:2196 poster

Authors:Junyan Liu, Lillian J. Ratliff

Title: Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents

Abstract:
We study the repeated principal-agent bandit game, where the principal indirectly explores an unknown environment by incentivizing an agent to play arms. Unlike prior work that assumes a greedy agent with full knowledge of reward means, we consider a self-interested learning agent who iteratively updates reward estimates and may explore arbitrarily with some probability. As a warm-up, we first consider a self-interested learning agent without exploration. We propose algorithms for both i.i.d. and linear reward settings with bandit feedback in a finite horizon $T$, achieving regret bounds of $\widetilde{O}(\sqrt{T})$ and $\widetilde{O}(T^{\frac{2}{3}})$, respectively. Specifically, these algorithms rely on a novel elimination framework coupled with new search algorithms which accommodate the uncertainty from the agent's learning behavior. We then extend the framework to handle an exploratory learning agent and develop an algorithm to achieve a $\widetilde{O}(T^{\frac{2}{3}})$ regret bound in i.i.d. reward setup by enhancing the robustness of our elimination framework to the potential agent exploration. Finally, when our agent model reduces to that in (Dogan et al., 2023a), we propose an algorithm based on our robust framework, which achieves a $\widetilde{O}(\sqrt{T})$ regret bound, significantly improving upon their $\widetilde{O}(T^{\frac{11}{12}})$ bound.

Paperid:2197 poster

Authors:Faisal Hamman, Pasan Dissanayake, Saumitra Mishra, Freddy Lecue, Sanghamitra Dutta

Title: Quantifying Prediction Consistency Under Fine-tuning Multiplicity in Tabular LLMs

Abstract:
Fine-tuning LLMs on tabular classification tasks can lead to the phenomenon offine-tuning multiplicitywhere equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, weight initialization, minor changes to training data, etc., raising concerns about the reliability of Tabular LLMs in high-stakes applications such as finance, hiring, education, healthcare. Our work formalizes this unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the consistency of individual predictions without expensive model retraining. Our measure quantifies a prediction's consistency by analyzing (sampling) the model's local behavior around that input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic guarantees on prediction consistency under a broad class of fine-tuned models, i.e., inputs with sufficiently high local stability (as defined by our measure) also remain consistent across several fine-tuned models with high probability. We perform experiments on multiple real-world datasets to show that our local stability measure preemptively captures consistency under actual multiplicity across several fine-tuned models, outperforming competing measures.

Paperid:2198 poster

Authors:Alon Arad, Saharon Rosset

Title: Improving Multi-Class Calibration through Normalization-Aware Isotonic Techniques

Abstract:
Accurate and reliable probability predictions are essential for multi-class supervised learning tasks, where well-calibrated models enable rational decision-making. While isotonic regression has proven effective for binary calibration, its extension to multi-class problems via one-vs-rest calibration often produces suboptimal results, limiting its practical adoption. In this work, we propose novel isotonic normalization-aware techniques for multi-class calibration, grounded in natural and intuitive assumptions expected by practitioners. Unlike prior approaches, our methods inherently account for probability normalization by either incorporating normalization directly into the optimization process (NA-FIR) or modeling the problem as a cumulative bivariate isotonic regression (SCIR). Empirical evaluations on a variety of text and image classification datasets across different model architectures reveal that our approach consistently improves log loss and expected calibration error (ECE) metrics. These findings underscore the potential of our approach to enhance a-parametric multi-class calibration practices, offering an adaptable solution for real-world applications.

Paperid:2199 poster

Authors:Shih-Min Yang, Martin Magnusson, Johannes A. Stork, Todor Stoyanov

Title: KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies

Abstract:
Soft Actor-Critic (SAC) has achieved notable success in continuous control tasks but struggles in sparse reward settings, where infrequent rewards make efficient exploration challenging. While novelty-based exploration methods address this issue by encouraging the agent to explore novel states, they are not trivial to apply to SAC. In particular, managing the interaction between novelty-based exploration and SAC’s stochastic policy can lead to inefficient exploration and redundant sample collection. In this paper, we propose KEA (Keeping Exploration Alive) which tackles the inefficiencies in balancing exploration strategies when combining SAC with novelty-based exploration. KEA integrates a novelty-augmented SAC with a standard SAC agent, proactively coordinated via a switching mechanism. This coordination allows the agent to maintain stochasticity in high-novelty regions, enhancing exploration efficiency and reducing repeated sample collection. We first analyze this potential issue in a 2D navigation task, and then evaluate KEA on the DeepSea hard-exploration benchmark as well as sparse reward control tasks from the DeepMind Control Suite. Compared to state-of-the-art novelty-based exploration baselines, our experiments show that KEA significantly improves learning efficiency and robustness in sparse reward setups.

Paperid:2200 poster

Authors:Ari Karchmer, Eran Malach

Title: The Power of Random Features and the Limits of Distribution-Free Gradient Descent

Abstract:
We study the relationship between gradient-based optimization of parametric models (e.g., neural networks) and optimization of linear combinations of random features. Our main result shows that if a parametric model can be learned using mini-batch stochastic gradient descent (bSGD) without making assumptions about the data distribution, then with high probability, the target function can also be approximated using a polynomial-sized combination of random features. The size of this combination depends on the number of gradient steps and numerical precision used in the bSGD process. This finding reveals fundamental limitations of distribution-free learning in neural networks trained by gradient descent, highlighting why making assumptions about data distributions is often crucial in practice. Along the way, we also introduce a new theoretical framework called average probabilistic dimension complexity (adc), which extends the probabilistic dimension complexity developed by Kamath et al. (2020). We prove that adc has a polynomial relationship with statistical query dimension, and use this relationship to demonstrate an infinite separation between adc and standard dimension complexity.

Paperid:2201 poster

Authors:Virginia Aglietti, Ira Ktena, Jessica Schrouff, Eleni Sgouritsa, Francisco Ruiz, Alan Malek, Alexis Bellot, Silvia Chiappa

Title: FunBO: Discovering Acquisition Functions for Bayesian Optimization with FunSearch

Abstract:
The sample efficiency of Bayesian optimization algorithms depends on carefully crafted acquisition functions (AFs) guiding the sequential collection of function evaluations. The best-performing AFs can vary significantly across optimization problems, often requiring ad-hoc and problem-specific choices. This work tackles the challenge of designing novel AFs that perform well across a variety of experimental settings. Based on FunSearch, a recent work using Large Language Models (LLMs) for discovery in mathematical sciences, we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a number of evaluations for a limited set of objective functions. We provide the analytic expression of all discovered AFs and evaluate them on various global optimization benchmarks and hyperparameter optimization tasks. We show how FunBO identifies AFs that generalize well both in and out of the training distribution of functions, thus outperforming established general-purpose AFs and achieving competitive performance against AFs that are customized to specific function types and are learned via transfer-learning algorithms.

Paperid:2202 poster

Authors:Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoorthi, Mohamed Elhoseiny, Vikas Chandra

Title: LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Abstract:
Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose \textbf{LongVU}, a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Paperid:2203 poster

Authors:Tianci Bu, Le Zhou, Wenchuan Yang, Jianhong Mou, Kang Yang, Suoyi Tan, Feng Yao, Jingyuan Wang, Xin Lu

Title: ProDiff: Prototype-Guided Diffusion for Minimal Information Trajectory Imputation

Abstract:
Trajectory data is crucial for various applications but often suffers from incompleteness due to device limitations and diverse collection scenarios. Existing imputation methods rely on sparse trajectory or travel information, such as velocity, to infer missing points. However, these approaches assume that sparse trajectories retain essential behavioral patterns, which place significant demands on data acquisition and overlook the potential of large-scale human trajectory embeddings. To address this, we propose ProDiff, a trajectory imputation framework that uses only two endpoints as minimal information. It integrates prototype learning to embed human movement patterns and a denoising diffusion probabilistic model for robust spatiotemporal reconstruction. Joint training with a tailored loss function ensures effective imputation. ProDiff outperforms state-of-the-art methods, improving accuracy by 6.28% on FourSquare and 2.52% on WuXi. Further analysis shows a 0.927 correlation between generated and real trajectories, demonstrating the effectiveness of our approach.

Paperid:2204 poster

Authors:Haichen Hu, Rui Ai, Stephen Bates, David Simchi-Levi

Title: Contextual Online Decision Making with Infinite-Dimensional Functional Regression

Abstract:
Contextual sequential decision-making is fundamental to machine learning, with applications in bandits, sequential hypothesis testing, and online risk control. These tasks often rely on statistical measures like expectation, variance, and quantiles. In this paper, we propose a universal algorithmic framework that learns the full underlying distribution, enabling a unified approach to all contextual online decision-making problems. The challenge lies in the uncountably infinite-dimensional regression, where existing contextual bandit algorithms all yield infinite regret. We innovatively propose an efficient infinite-dimensional functional regression oracle for contextual cumulative distribution functions (CDFs) and model every datum as a combination of context-dependent CDF basis functions. Our analysis reveals that the decay rate of the eigenvalue sequence of the design integral operator governs the regression error rate, and consequently, the utility regret rate. Specifically, when the eigenvalue sequence exhibits a polynomial decay of order1γ�?, the utility regret is bounded byO~(T3γ+22(γ+2)). The case thatγ=0can recover the existing optimal rate in contextual bandits literature with finite-dimensional regression and so as exponential decay. We also provide a numerical method to compute the eigenvalue sequence of integral operators, enabling the practical implementation of our framework.

Paperid:2205 poster

Authors:Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Danziger, Jannis Born

Title: Regress, Don't Guess: A Regression-like Loss on Number Tokens for Language Models

Abstract:
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposedNumber Token Loss(NTL) comes in two flavors and minimizes either the $\mathcal{L}_p$ norm or the Wasserstein distance between thenumerical valuesof the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives. The code is available via:https://tum-ai.github.io/number-token-loss/

Paperid:2206 poster

Authors:Santiago Cortes-Gomez, Naveen Janaki Raman, Aarti Singh, Bryan Wilder

Title: Data-driven Design of Randomized Control Trials with Guaranteed Treatment Effects

Abstract:
Randomized controlled trials (RCTs) generate guarantees for treatment effects. However, RCTs often spend unnecessary resources exploring sub-optimal treatments, which can reduce the power of treatment guarantees. To address this, we propose a two-stage RCT design. In the first stage, a data-driven screening procedure prunes low-impact treatments, while the second stage focuses on developing high-probability lower bounds for the best-performing treatment effect. Unlike existing adaptive RCT frameworks, our method is simple enough to be implemented in scenarios with limited adaptivity. We derive optimal designs for two-stage RCTs and demonstrate how such designs can be implemented through sample splitting. Empirically, we demonstrate that two-stage designs improve upon single-stage approaches, especially for scenarios where domain knowledge is available through a prior. Our work is thus, a simple yet effective design for RCTs, optimizing for the ability to certify with high probability the largest possible treatment effect for at least one of the arms studied.

Paperid:2207 poster

Authors:Sarthak Mittal, Eric Elmoznino, Leo Gagnon, Sangnie Bhardwaj, Guillaume Lajoie, Dhanya Sridhar

Title: Does learning the right latent variables necessarily improve in-context learning?

Abstract:
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting avenues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or instead exploit heuristics and statistical shortcuts through attention layers. In this paper, we systematically investigate the effect of explicitly inferring task latents by minimally modifying the Transformer architecture with a bottleneck to prevent shortcuts and incentivize structured solutions. We compare it against standard Transformers across various ICL tasks and find that contrary to intuition and recent works, there is little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.

Paperid:2208 poster

Authors:Zhiyang Wang, Juan Cervino, Alejandro Ribeiro

Title: A Manifold Perspective on the Statistical Generalization of Graph Neural Networks

Abstract:
Graph Neural Networks (GNNs) extend convolutional neural networks to operate on graphs. Despite their impressive performances in various graph learning tasks, the theoretical understanding of their generalization capability is still lacking. Previous GNN generalization bounds ignore the underlying graph structures, often leading to bounds that increase with the number of nodes �?a behavior contrary to the one experienced in practice. In this paper, we take a manifold perspective to establish the statistical generalization theory of GNNs on graphs sampled from a manifold in the spectral domain. As demonstrated empirically, we prove that the generalization bounds of GNNs decrease linearly with the size of the graphs in the logarithmic scale, and increase linearly with the spectral continuity constants of the filter functions. Notably, our theory explains both node-level and graph-level tasks. Our result has two implications: i) guaranteeing the generalization of GNNs to unseen data over manifolds; ii) providing insights into the practical design of GNNs, i.e., restrictions on the discriminability of GNNs are necessary to obtain a better generalization performance. We demonstrate our generalization bounds of GNNs using synthetic and multiple real-world datasets.

Paperid:2209 poster

Authors:Anqi Mao, Mehryar Mohri, Yutao Zhong

Title: Principled Algorithms for Optimizing Generalized Metrics in Binary Classification

Abstract:
In applications with significant class imbalance or asymmetric costs, metrics such as theFβ-measure, AM measure, Jaccard similarity coefficient, and weighted accuracy offer more suitable evaluation criteria than standard binary classification loss. However, optimizing these metrics present significant computational and statistical challenges. Existing approaches often rely on the characterization of the Bayes-optimal classifier, and use threshold-based methods that first estimate class probabilities and then seek an optimal threshold. This leads to algorithms that are not tailored to restricted hypothesis sets and lack finite-sample performance guarantees. In this work, we introduce principled algorithms for optimizing generalized metrics, supported byH-consistency and finite-sample generalization bounds. Our approach reformulates metric optimization as a generalized cost-sensitive learning problem, enabling the design of novel surrogate loss functions with provableH-consistency guarantees. Leveraging this framework, we develop new algorithms, METRO (Metric Optimization), with strong theoretical performance guarantees. We report the results of experiments demonstrating the effectiveness of our methods compared to prior baselines.

Paperid:2210 poster

Authors:Vincenzo de paola, Riccardo Zamboni, Mirco Mutti, Marcello Restelli

Title: Enhancing Diversity In Parallel Agents: A Maximum State Entropy Exploration Story

Abstract:
Parallel data collection has redefined Reinforcement Learning (RL), unlocking unprecedented efficiency and powering breakthroughs in large-scale real-world applications. In this paradigm, $N$ identical agents operate in $N$ replicas of an environment simulator, accelerating data collection by a factor of $N$. A critical question arises:Does specializing the policies of the parallel agents hold the key to surpass the $N$ factor acceleration?In this paper, we introduce a novel learning framework that maximizes the entropy of collected data in a parallel setting. Our approach carefully balances the entropy of individual agents with inter-agent diversity, effectively minimizing redundancies. The latter idea is implemented with a centralized policy gradient method, which shows promise when evaluated empirically against systems of identical agents, as well as synergy with batch RL techniques that can exploit data diversity. Finally, we provide an original concentration analysis that shows faster rates for specialized parallel sampling distributions, which supports our methodology and may be of independent interest.

Paperid:2211 poster

Authors:Boyu Zhang, Aocheng Shen, Bing Liu, Qiankun Zhang, Bin Yuan, Jing Wang, Shenghao Liu, Xianjun Deng

Title: DiMa: Understanding the Hardness of Online Matching Problems via Diffusion Models

Abstract:
We explore the potential of \emph{AI-enhanced combinatorial optimization theory}, taking online bipartite matching (OBM) as a case study. In the theoretical study of OBM, the \emph{hardness} corresponds to a performance \emph{upper bound} of a specific online algorithm or any possible online algorithms. Typically, these upper bounds derive from challenging instances meticulously designed by theoretical computer scientists. Zhang et al. (ICML 2024) recently provide an example demonstrating how reinforcement learning techniques enhance the hardness result of a specific OBM model. Their attempt is inspiring but preliminary. It is unclear whether their methods can be applied to other OBM problems with similar breakthroughs. This paper takes a further step by introducing DiMa, a unified and novel framework that aims at understanding the hardness of OBM problems based on denoising diffusion probabilistic models (DDPMs). DiMa models the process of generating hard instances as denoising steps, and optimizes them by a novel reinforcement learning algorithm, named \emph{shortcut policy gradient} (SPG). We first examine DiMa on the classic OBM problem by reproducing its known hardest input instance in literature. Further, we apply DiMa to two well-known variants of OBM, for which the exact hardness remains an open problem, and we successfully improve their theoretical state-of-the-art upper bounds.

Paperid:2212 poster

Authors:Qiangqiang Zhang, Ting Li, Xinwei Feng, Xiaodong Yan, Jinhan Xie

Title: Online Differentially Private Conformal Prediction for Uncertainty Quantification

Abstract:
Traditional conformal prediction faces significant challenges with the rise of streaming data and increasing concerns over privacy. In this paper, we introduce a novel online differentially private conformal prediction framework, designed to construct dynamic, model-free private prediction sets. Unlike existing approaches that either disregard privacy or require full access to the entire dataset, our proposed method ensures individual privacy with a one-pass algorithm, ideal for real-time, privacy-preserving decision-making. Theoretically, we establish guarantees for long-run coverage at the nominal confidence level. Moreover, we extend our method to conformal quantile regression, which is fully adaptive to heteroscedasticity. We validate the effectiveness and applicability of the proposed method through comprehensive simulations and real-world studies on the ELEC2 and PAMAP2 datasets.

Paperid:2213 poster

Authors:Jieting Wang, ZhangZelong, Feijiang Li, Yuhua Qian, Xinyan Liang

Title: Stabilizing Sample Similarity in Representation via Mitigating Random Consistency

Abstract:
Deep learning excels at capturing complex data representations, yet quantifying the discriminative quality of these representations remains challenging. While unsupervised metrics often assess pairwise sample similarity, classification tasks fundamentally require class-level discrimination. To bridge this gap, we propose a novel loss function that evaluates representation discriminability via the Euclidean distance between the learned similarity matrix and the true class adjacency matrix. We identify random consistency—an inherent bias in Euclidean distance metrics—as a key obstacle to reliable evaluation, affecting both fairness and discrimination. To address this, we derive the expected Euclidean distance under uniformly distributed label permutations and introduce its closed-form solution, the Pure Square Euclidean Distance (PSED), which provably eliminates random consistency. Theoretically, we demonstrate that PSED satisfies heterogeneity and unbiasedness guarantees, and establish its generalization bound via the exponential Orlicz norm, confirming its statistical learnability. Empirically, our method surpasses conventional loss functions across multiple benchmarks, achieving significant improvements in accuracy,F1score, and class-structure differentiation. (Code is published inhttps://github.com/FeijiangLi/ICML2025-PSED)

Paperid:2214 poster

Authors:Kazuto Fukuchi

Title: Meta Optimality for Demographic Parity Constrained Regression via Post-Processing

Abstract:
We address the regression problem under the constraint of demographic parity, a commonly used fairness definition. Recent studies have revealed fair minimax optimal regression algorithms, the most accurate algorithms that adhere to the fairness constraint. However, these analyses are tightly coupled with specific data generation models. In this paper, we provide meta-theorems that can be applied to various situations to validate the fair minimax optimality of the corresponding regression algorithms. Furthermore, we demonstrate that fair minimax optimal regression can be achieved through post-processing methods, allowing researchers and practitioners to focus on improving conventional regression techniques, which can then be efficiently adapted for fair regression.

Paperid:2215 poster

Authors:Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell

Title: Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent

Abstract:
This paper investigates the convergence properties of the hypergradient descent method (HDM), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis ofHDMusing the online learning framework and apply this analysis to develop a new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably,HDMautomatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability ofHDMreported in the literature and proposes efficient strategies to address it. We also develop twoHDMvariants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems showHDMwith heavy-ball momentum (HDM-HB) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover,HDM-HBoften matches the performance ofL-BFGS, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.

Paperid:2216 poster

Authors:Joshua Southern, Yam Eitan, Guy Bar-Shalom, Michael M. Bronstein, Haggai Maron, Fabrizio Frasca

Title: Balancing Efficiency and Expressiveness: Subgraph GNNs with Walk-Based Centrality

Abstract:
Subgraph GNNs have emerged as promising architectures that overcome the expressiveness limitations of Graph Neural Networks (GNNs) by processing bags of subgraphs. Despite their compelling empirical performance, these methods are afflicted by a high computational complexity: they process bags whose size grows linearly in the number of nodes, hindering their applicability to larger graphs. In this work, we propose an effective and easy-to-implement approach to dramatically alleviate the computational cost of Subgraph GNNs and unleash broader applications thereof. Our method, dubbed HyMN, leverages walk-based centrality measures to sample a small number of relevant subgraphs and drastically reduce the bag size. By drawing a connection to perturbation analysis, we highlight the strength of the proposed centrality-based subgraph sampling, and further prove that these walk-based centralities can be additionally used as Structural Encodings for improved discriminative power. A comprehensive set of experimental results demonstrates that HyMN provides an effective synthesis of expressiveness, efficiency, and downstream performance, unlocking the application of Subgraph GNNs to dramatically larger graphs. Not only does our method outperform more sophisticated subgraph sampling approaches, it is also competitive, and sometimes better, than other state-of-the-art approaches for a fraction of their runtime.

Paperid:2217 poster

Authors:Gen Li, Yuchen Jiao

Title: Provable Efficiency of Guidance in Diffusion Models for General Data Distribution

Abstract:
Diffusion models have emerged as a powerful framework for generative modeling, with guidance techniques playing a crucial role in enhancing sample quality. Despite their empirical success, a comprehensive theoretical understanding of the guidance effect remains limited. Existing studies only focus on case studies, where the distribution conditioned on each class is either isotropic Gaussian or supported on a one-dimensional interval with some extra conditions. How to analyze the guidance effect beyond these case studies remains an open question. Towards closing this gap, we make an attempt to analyze diffusion guidance under general data distributions. Rather than demonstrating uniform sample quality improvement, which does not hold in some distributions, we prove that guidance can improve the whole sample quality, in the sense that the ratio of bad samples (measured by the classifier probability) decreases in the presence of guidance. This aligns with the motivation of introducing guidance.

Paperid:2218 poster

Authors:Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, David OSOWIECHI, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers

Title: SMART-PC: Skeletal Model Adaptation for Robust Test-Time Training in Point Clouds

Abstract:
Test-Time Training has emerged as a promising solution to address distribution shifts in 3D point cloud classification. However, existing methods often rely on computationally expensive backpropagation during adaptation, limiting their applicability in real-world, time-sensitive scenarios. In this paper, we introduce SMART-PC, a skeleton-based framework that enhances resilience to corruptions by leveraging the geometric structure of 3D point clouds. During pre-training, our method predicts skeletal representations, enabling the model to extract robust and meaningful geometric features that are less sensitive to corruptions, thereby improving adaptability to test-time distribution shifts. Unlike prior approaches, SMART-PC achieves real-time adaptation by eliminating backpropagation and updating only BatchNorm statistics, resulting in a lightweight and efficient framework capable of achieving high frame-per-second rates while maintaining superior classification performance. Extensive experiments on benchmark datasets, including ModelNet40-C, ShapeNet-C, and ScanObjectNN-C, demonstrate that SMART-PC achieves state-of-the-art results, outperforming existing methods such as MATE in terms of both accuracy and computational efficiency. The implementation is available at: \url{https://github.com/AliBahri94/SMART-PC}.

Paperid:2219 poster

Authors:Ganyu Wang, Jinjie Fang, Maxwell Juncheng Yin, Bin Gu, Xi Chen, Boyu Wang, Yi Chang, Charles Ling

Title: FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning

Abstract:
Black-Box Discrete Prompt Learning (BDPL) is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible. Adapting Federated Learning (FL) to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service. To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning. Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textit{FedOne}, enabled optimal query efficiency in federated black-box prompt learning. Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs. We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.

Paperid:2220 poster

Authors:Jiawen Wang, Yinda Chen, Xiaoyu Liu, Che Liu, Dong Liu, Jianqing Gao, Zhiwei Xiong

Title: MaskTwins: Dual-form Complementary Masking for Domain-Adaptive Image Segmentation

Abstract:
Recent works have correlated Masked Image Modeling (MIM) with consistency regularization in Unsupervised Domain Adaptation (UDA). However, they merely treat masking as a special form of deformation on the input images and neglect the theoretical analysis, which leads to a superficial understanding of masked reconstruction and insufficient exploitation of its potential in enhancing feature extraction and representation learning. In this paper, we reframe masked reconstruction as a sparse signal reconstruction problem and theoretically prove that the dual form of complementary masks possesses superior capabilities in extracting domain-agnostic image features. Based on this compelling insight, we propose MaskTwins, a simple yet effective UDA framework that integrates masked reconstruction directly into the main training pipeline. MaskTwins uncovers intrinsic structural patterns that persist across disparate domains by enforcing consistency between predictions of images masked in complementary ways, enabling domain generalization in an end-to-end manner. Extensive experiments verify the superiority of MaskTwins over baseline methods in natural and biological image segmentation. These results demonstrate the significant advantages of MaskTwins in extracting domain-invariant features without the need for separate pre-training, offering a new paradigm for domain-adaptive segmentation. The source code is available athttps://github.com/jwwang0421/masktwins.

Paperid:2221 poster

Authors:Kexin Huang, Ziqian Chen, Xue Wang, Chongming Gao, Jinyang Gao, Bolin Ding, Xiang Wang

Title: Learning Bayesian Nash Equilibrium in Auction Games via Approximate Best Response

Abstract:
Auction plays a crucial role in many modern trading environments, including online advertising and public resource allocation. As the number of competing bidders increases, learning Bayesian Nash Equilibrium (BNE) in auctions faces significant scalability challenges. Existing methods often experience slow convergence in large-scale auctions. For example, in a classic symmetric auction setting, the convergence rate depends on the number of bidders quadratically. To address this issue, we propose theApproximate Best Response Gradientmethod, a new approach for learning BNE efficiently in auction games. We leverage an analytic solution for gradient estimation to enable efficient gradient computation during optimization. Moreover, we introduce theBest Response Distanceobjective, which serves as an upper bound of approximation quality to BNE. By optimizing the new objective, our method is proven to achieve a local convergence rate independent of bidder numbers and circumvent the traditional quadratic complexity in the classic symmetric setting. Extensive experiments across various auction formats demonstrate that our approach accelerates convergence and enhances learning efficiency in complex auction settings.

Paperid:2222 poster

Authors:Xinjie Yao, Yu Wang, Pengfei Zhu, Wanyu Lin, Ruipu Zhao, Zhoupeng Guo, Weihao Li, Qinghua Hu

Title: Socialized Coevolution: Advancing a Better World through Cross-Task Collaboration

Abstract:
Traditional machine societies rely on data-driven learning, overlooking interactions and limiting knowledge acquisition from model interplay. To address these issues, we revisit the development of machine societies by drawing inspiration from the evolutionary processes of human societies. Motivated by Social Learning (SL), this paper introduces a practical paradigm of Socialized Coevolution (SC). Compared to most existing methods focused on knowledge distillation and multi-task learning, our work addresses a more challenging problem: not only enhancing the capacity to solve new downstream tasks but also improving the performance of existing tasks through inter-model interactions. Inspired by cognitive science, we propose Dynamic Information Socialized Collaboration (DISC), which achieves SC through interactions between models specialized in different downstream tasks. Specifically, we introduce the dynamic hierarchical collaboration and dynamic selective collaboration modules to enable dynamic and effective interactions among models, allowing them to acquire knowledge from these interactions. Finally, we explore potential future applications of combining SL and SC, discuss open questions, and propose directions for future research, aiming to spark interest in this emerging and exciting interdisciplinary field. Our code will be publicly available athttps://github.com/yxjdarren/SC.

Paperid:2223 poster

Authors:Zheng Zhou, Wenquan Feng, Qiaosheng Zhang, Shuchang Lyu, Qi Zhao, Guangliang Cheng

Title: ROME is Forged in Adversity: Robust Distilled Datasets via Information Bottleneck

Abstract:
Dataset Distillation (DD) compresses large datasets into smaller, synthetic subsets, enabling models trained on them to achieve performance comparable to those trained on the full data. However, these models remain vulnerable to adversarial attacks, limiting their use in safety-critical applications. While adversarial robustness has been extensively studied in related fields, research on improving DD robustness is still limited. To address this, we propose ROME, a novel method that enhances the adversarial RObustness of DD by leveraging the InforMation BottlenEck (IB) principle. ROME includes two components: a performance-aligned term to preserve accuracy and a robustness-aligned term to improve robustness by aligning feature distributions between synthetic and perturbed images. Furthermore, we introduce the Improved Robustness Ratio (I-RR), a refined metric to better evaluate DD robustness. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate that ROME outperforms existing DD methods in adversarial robustness, achieving maximum I-RR improvements of nearly 40% under white-box attacks and nearly 35% under black-box attacks. Our code is available athttps://github.com/zhouzhengqd/ROME.

Paperid:2224 poster

Authors:Qiwei Di, Jiafan He, Quanquan Gu

Title: Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Abstract:
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits ($\texttt{RCDB}$), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $\tilde O(d\sqrt{T}/\kappa+dC/\kappa)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, $\kappa$ is the lower bound of the derivative of the link function, and $ 0 \le C \le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback. Additionally, for the sigmoid link function, we develop a novel algorithm that takes into account the effect of local derivatives into maximum likelihood estimation (MLE) analysis through a refined method for estimating the link function's derivative. This method helps us to eliminate the $\kappa$ dependence in the leading term with respect to $T$, which reduces the exponential dependence on the parameter radius $B$ to a polynomial dependence. We conduct experiments to evaluate our proposed algorithm $\texttt{RCDB}$ against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

Paperid:2225 poster

Authors:Liang CHEN, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong

Title: Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning

Abstract:
Harmful fine-tuning (HFT), performed directly on open-source LLMs or through Fine-tuning-as-a-Service, breaks safety alignment and poses significant threats. Existing methods aim to mitigate HFT risks by learning robust representation on alignment data or making harmful data unlearnable, but they treat each data sample equally, leaving data vulnerability patterns understudied. In this work, we reveal that certain subsets of alignment data are consistently more prone to forgetting during HFT across different fine-tuning tasks and exhibit lower robustness compared to other subsets. Inspired by these findings, we propose Vulnerability-Aware Alignment (VAA), which calculates data vulnerability, partitions data into "vulnerable" and "invulnerable" groups, and encourages balanced learning using a group distributionally robust optimization (Group DRO) framework. Specifically, VAA learns an adversarial sampler that samples examples from the currently underperforming group and then applies group-dependent adversarial perturbations to the data during training, aiming to encourage a balanced learning process across groups. Experiments across four fine-tuning tasks demonstrate that VAA significantly reduces harmful scores while preserving downstream task performance, outperforming state-of-the-art baselines.

Paperid:2226 poster

Authors:Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu

Title: Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization

Abstract:
Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available athttps://github.com/WeizhiGao/MoDiff.

Paperid:2227 poster

Authors:Shyam Nuggehalli, Jifan Zhang, Lalit K Jain, Robert D Nowak

Title: Improved Algorithm for Deep Active Learning under Imbalance via Optimal Separation

Abstract:
Class imbalance severely impacts machine learning performance on minority classes in real-world applications. While various solutions exist, active learning offers a fundamental fix by strategically collecting balanced, informative labeled examples from abundant unlabeled data. We introduce DIRECT, an algorithm that identifies class separation boundaries and selects the most uncertain nearby examples for annotation. By reducing the problem to one-dimensional active learning, DIRECT leverages established theory to handle batch labeling and label noise -- another common challenge in data annotation that particularly affects active learning methods. Our work presents the first comprehensive study of active learning under both class imbalance and label noise. Extensive experiments on imbalanced datasets show DIRECT reduces annotation costs by over 60% compared to state-of-the-art active learning methods and over 80% versus random sampling, while maintaining robustness to label noise.

Paperid:2228 poster

Authors:Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman

Title: Synthetic Text Generation for Training Large Language Models via Gradient Matching

Abstract:
Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available athttps://github.com/BigML-CS-UCLA/GRADMM.

Paperid:2229 poster

Authors:Haoming Jing, yorie nakahira

Title: Safety Certificate against Latent Variables with Partially Unidentifiable Dynamics

Abstract:
Many systems contain latent variables that make their dynamics partially unidentifiable or cause distribution shifts in the observed statistics between offline and online data. However, existing control techniques often assume access to complete dynamics or perfect simulators with fully observable states, which are necessary to verify whether the system remains within a safe set (forward invariance) or safe actions are consistently feasible at all times. To address this limitation, we propose a technique for designing probabilistic safety certificates for systems with latent variables. A key technical enabler is the formulation of invariance conditions in probability space, which can be constructed using observed statistics in the presence of distribution shifts due to latent variables. We use this invariance condition to construct a safety certificate that can be implemented efficiently in real-time control. The proposed safety certificate can continuously find feasible actions that control long-term risk to stay within tolerance. Stochastic safe control and (causal) reinforcement learning have been studied in isolation until now. To the best of our knowledge, the proposed work is the first to use causal reinforcement learning to quantify long-term risk for the design of safety certificates. This integration enables safety certificates to efficiently ensure long-term safety in the presence of latent variables. The effectiveness of the proposed safety certificate is demonstrated in numerical simulations.

Paperid:2230 poster

Authors:Yassine ABBAHADDOU, Fragkiskos D. Malliaros, Johannes F. Lutzeyer, Amine M. Aboussalah, Michalis Vazirgiannis

Title: Graph Neural Network Generalization With Gaussian Mixture Model Based Augmentation

Abstract:
Graph Neural Networks (GNNs) have shown great promise in tasks like node and graph classification, but they often struggle to generalize, particularly to unseen or out-of-distribution (OOD) data. These challenges are exacerbated when training data is limited in size or diversity. To address these issues, we introduce a theoretical framework using Rademacher complexity to compute a regret bound on the generalization error and then characterize the effect of data augmentation. This framework informs the design of GRATIN, an efficient graph data augmentation algorithm leveraging the capability of Gaussian Mixture Models (GMMs) to approximate any distribution. Our approach not only outperforms existing augmentation techniques in terms of generalization but also offers improved time complexity, making it highly suitable for real-world applications.

Paperid:2231 poster

Authors:Boyang Sun, Yu Yao, Xinshuai Dong, Zongfang Liu, Tongliang Liu, Yumou Qiu, Kun Zhang

Title: A Sample Efficient Conditional Independence Test in the Presence of Discretization

Abstract:
Conditional independence (CI) test is a fundamental concept in statistics. In many real-world scenarios, some variables may be difficult to measure accurately, often leading to data being represented as discretized values. Applying CI tests directly to discretized data, however, can lead to incorrect conclusions about the independence of latent variables. To address this, recent advancements have sought to infer the correct CI relationship between the latent variables by binarizing the observed data. However, this process results in a loss of information, which degrades the test's performance, particularly with small sample sizes. Motivated by this, this paper introduces a new sample-efficient CI test that does not rely on the binarization process. We find that the relationship can be established by addressing an \textit{over-identifying} restriction problem with \textit{Generalized Method of Moments} (GMM). Based on this finding, we have designed a new test statistic, and its asymptotic distribution has been derived. Empirical results across various datasets show that our method consistently outperforms existing ones.

Paperid:2232 poster

Authors:Geyu Liang, Senne Michielssen, Salar Fattahi

Title: Enhancing Performance of Explainable AI Models with Constrained Concept Refinement

Abstract:
The trade-off between accuracy and interpretability has long been a challenge in machine learning (ML). This tension is particularly significant for emerginginterpretable-by-designmethods, which aim to redesign ML algorithms for trustworthy interpretability but often sacrifice accuracy in the process. In this paper, we address this gap by investigating the impact of deviations in concept representations—an essential component of interpretable models—on prediction performance and propose a novel framework to mitigate these effects. The framework builds on the principle of optimizing concept embeddings under constraints that preserve interpretability. Using a generative model as a test-bed, we rigorously prove that our algorithm achieves zero loss while progressively enhancing the interpretability of the resulting model. Additionally, we evaluate the practical performance of our proposed framework in generating explainable predictions for image classification tasks across various benchmarks. Compared to existing explainable methods, our approach not only improves prediction accuracy while preserving model interpretability across various large-scale benchmarks but also achieves this with significantly lower computational cost.

Paperid:2233 poster

Authors:Kian Kenyon-Dean, Zitong Jerry Wang, John Urbanik, Konstantin Donhauser, Jason Hartford, Saber Saberian, Nil Sahin, Ihab Bendidi, Safiye Celik, Juan Sebastián Rodríguez Vera, Marta Fay, Imran S Haque, Oren Kraus

Title: ViTally Consistent: Scaling Biological Representation Learning for Cell Microscopy

Abstract:
Deriving insights from experimentally generated datasets requires methods that can account for random and systematic measurement errors and remove them in order to accurately represent the underlying effects of the conditions being tested. Here we present a framework for pretraining on large-scale microscopy datasets that includes three steps: (1) curating a set of diverse and self-consistent training samples, (2) scaling training of an appropriate foundation model architecture on this dataset, (3) evaluating intermediate layers of the trained model to identify the best representation for downstream tasks. Using this strategy, we present the largest foundation model for cell microscopy data to our knowledge, a new 1.9 billion-parameter ViT-G/8 MAE trained on over 8 billion microscopy image crops. Compared to a previous published ViT-L/8 MAE, our new model achieves a 60% improvement in linear separability of genetic perturbations and obtains the best overall performance on whole-genome relationship recall, batch correction replicate consistency, and compound-gene activity prediction benchmarks.

Paperid:2234 poster

Authors:Huaicheng Zhou, Zifeng Zhuang, Donglin Wang

Title: Stay Hungry, Keep Learning: Sustainable Plasticity for Deep Reinforcement Learning

Abstract:
The integration of Deep Neural Networks in Reinforcement Learning (RL) systems has led to remarkable progress in solving complex tasks but also introduced challenges like primacy bias and dead neurons. Primacy bias skews learning towards early experiences, while dead neurons diminish the network's capacity to acquire new knowledge. Traditional reset mechanisms aimed at addressing these issues often involve maintaining large replay buffers to train new networks or selectively resetting subsets of neurons. However, these approaches either incur prohibitive computational costs or reset network parameters without ensuring stability through recovery mechanisms, ultimately impairing learning efficiency. In this work, we introduce the novel concept of neuron regeneration, which combines reset mechanisms with knowledge recovery techniques. We also propose a new framework called Sustainable Backup Propagation(SBP) that effectively maintains plasticity in neural networks through this neuron regeneration process. The SBP framework achieves whole network neuron regeneration through two key procedures: cycle reset and inner distillation. Cycle reset involves a scheduled renewal of neurons, while inner distillation functions as a knowledge recovery mechanism at the neuron level. To validate our framework, we integrate SBP with Proximal Policy Optimization (PPO) and propose a novel distillation function for inner distillation. This integration results in Plastic PPO (P3O), a new algorithm that enables efficient cyclic regeneration of all neurons in the actor network. Extensive experiments demonstrate the approach effectively maintains policy plasticity and improves sample efficiency in reinforcement learning.

Paperid:2235 poster

Authors:Hongyin Zhang, Zifeng Zhuang, Han Zhao, Pengxiang Ding, Hongchao Lu, Donglin Wang

Title: ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

Abstract:
Vision-Language-Action (VLA) models have shown great potential in general robotic decision-making tasks via imitation learning. However, the variable quality of training data often constrains the performance of these models. On the other hand, offline Reinforcement Learning (RL) excels at learning robust policy models from mixed-quality data. In this paper, we introduce Reinforced robot GPT (ReinboT), a novel end-to-end VLA model that integrates the RL principle of maximizing cumulative reward. ReinboT achieves a deeper understanding of the data quality distribution by predicting dense returns that capture the nuances of manipulation tasks. The dense return prediction capability enables the robot to generate more robust decision-making actions, oriented towards maximizing future benefits. Extensive experiments show that ReinboT achieves state-of-the-art performance on the CALVIN mixed-quality dataset and exhibits superior few-shot learning and out-of-distribution generalization capabilities in real-world tasks.

Paperid:2236 poster

Authors:Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu

Title: Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Abstract:
Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these fixed rules lack the flexibility to address the unique characteristics of individual examples, yet crafting sample-wise rules is impractical for human experts. In this paper, we show that even small language models, with only 0.3B parameters, can exhibit substantial data refining capabilities. We propose Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, and enables the model to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experiments show that models trained on ProX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM). ProX also shows great potential in continual pre-training: on math domain, ProX boosts 7B models by up to 20% within 10B tokens—results typically achieved with much larger scale training (e.g., 200B tokens). We believe ProX offers a way to curate high-quality pre-training data, and finally contributes to efficient LLM development.

Paperid:2237 poster

Authors:Alina Ene, Alessandro Epasto, Vahab Mirrokni, Hoai-An Nguyen, Huy Nguyen, David Woodruff, Peilin Zhong

Title: Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

Abstract:
In the maximum coverage problem we are given $d$ subsets from a universe $[n]$, and the goal is to output $k$ subsets such that their union covers the largest possible number of distinct items. We present the first algorithm for maximum coverage in the turnstile streaming model, where updates which insert or delete an item from a subset come one-by-one. Notably our algorithm only uses $poly\log n$ update time. We also present turnstile streaming algorithms for targeted and general fingerprinting for risk management where the goal is to determine which features pose the greatest re-identification risk in a dataset. As part of our work, we give a result of independent interest: an algorithm to estimate the complement of the $p^{\text{th}}$ frequency moment of a vector for $p \geq 2$. Empirical evaluation confirms the practicality of our fingerprinting algorithms demonstrating a speedup of up to $210$x over prior work.

Paperid:2238 poster

Authors:Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

Title: Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Abstract:
We introduceDomain2Vec, a novel approach that decomposes any dataset into a linear combination of severalmeta-domains, a new concept designed to capture the key underlying features of datasets.Domain2Vecmaintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of optimal data mixture for language model (LM) pretraining in a training-free manner under theDistributionAlignmentAssumption(DA2), which suggests that when the data distribution of the training set and the validation set is more aligned, a lower validation loss is achieved. Moreover,Domain2Veccan be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate thatDomain2Vechelps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically,Domain2Vecachieves the same validation loss on Pile-CC using only51.5% of the compute required when training on the original mixture of The Pile Dataset. Under equivalent compute budget,Domain2Vecimproves downstream performance by an average of2.83%.

Paperid:2239 poster

Authors:Mingzhao Yang, Shangchao Su, Bin Li, Xiangyang Xue

Title: One-Shot Heterogeneous Federated Learning with Local Model-Guided Diffusion Models

Abstract:
In recent years, One-shot Federated Learning (OSFL) methods based on Diffusion Models (DMs) have garnered increasing attention due to their remarkable performance. However, most of these methods require the deployment of foundation models on client devices, which significantly raises the computational requirements and reduces their adaptability to heterogeneous client models. In this paper, we propose FedLMG, a heterogeneous one-shot Federated learning method with Local Model-Guided diffusion models. In our method, clients do not need access to any foundation models but only train and upload their local models, which is consistent with traditional FL methods. On the clients, we employ classification loss and batch normalization loss to capture the broad category features and detailed contextual features of the client distributions. On the server, based on the uploaded client models, we utilize backpropagation to guide the server’s DM in generating synthetic datasets that comply with the client distributions, which are then used to train the aggregated model. By using the local models as a medium to transfer client knowledge, our method significantly reduces the computational requirements on client devices and effectively adapts to scenarios with heterogeneous clients. Extensive quantitation and visualization experiments on three large-scale real-world datasets, along with theoretical analysis, demonstrate that the synthetic datasets generated by FedLMG exhibit comparable quality and diversity to the client datasets, which leads to an aggregated model that outperforms all compared methods and even the performance ceiling, further elucidating the significant potential of utilizing DMs in FL.

Paperid:2240 poster

Authors:Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Hai Jin, Dongping Chen

Title: CodeSync: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Abstract:
Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly the frequent updates of third-party library APIs. This limitation, rooted in the static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, we introduce CodeSync, a data engine to identify outdated code patterns and collect real-time code knowledge updates from Python third-party libraries. Building upon CodeSync, we develop CodeSyncBench, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases spanning three evaluation tasks and an update-aware instruction tuning dataset of 2,200 training samples. Extensive experiments on 14 LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). Our CodeSync lays a strong foundation for developing more effective and robust methods for real-time code knowledge updating in the future. The experimental code is available at:https://github.com/CGCL-codes/naturalcc/tree/main/examples/codesync.

Paperid:2241 poster

Authors:Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

Title: MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Abstract:
Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

Paperid:2242 poster

Authors:Changyi He, Yifu Ding, Jinyang Guo, Ruihao Gong, Haotong Qin, Xianglong Liu

Title: DA-KD: Difficulty-Aware Knowledge Distillation for Efficient Large Language Models

Abstract:
Although knowledge distillation (KD) is an effective approach to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a large LLM (i.e., the teacher model), it still suffers from high training cost. Existing LLM distillation methods ignore the difficulty difference among different samples, making the distillation of easy samples unnecessary. This leads to high distillation cost. In this paper, we propose difficulty-aware knowledge distillation (DA-KD) framework for efficient knowledge distillation, in which we dynamically adjust the distillation dataset based on the difficulty of samples. We further observe existing KD loss cannot perform well when most of samples are difficult in the distillation dataset because of unstable optimization and the neglect of hard samples. Therefore, we also propose a new KD loss called bidirectional discrepancy loss (BDL) for effective KD. Extensive experiments demonstrate that our DA-KD framework is effective and efficient. Without bells and whistles, DA-KD can outperform existing state-of-the-art KD methods by 2% with half training cost and even surpass the teacher model with 4.7$\times$ compression.

Paperid:2243 poster

Authors:Samuel Holt, Todor Davchev, Dhruva Tirumala, Ben Moran, Atil Iscen, Antoine Laurens, Yixin Lin, Erik Frey, Markus Wulfmeier, Francesco Romano, Nicolas Heess

Title: EvoControl: Multi-Frequency Bi-Level Control for High-Frequency Continuous Control

Abstract:
High-frequency control in continuous action and state spaces is essential for practical applications in the physical world. Directly applying end-to-end reinforcement learning to high-frequency control tasks struggles with assigning credit to actions across long temporal horizons, compounded by the difficulty of efficient exploration. The alternative, learning low-frequency policies that guide higher-frequency controllers (e.g., proportional-derivative (PD) controllers), can result in a limited total expressiveness of the combined control system, hindering overall performance. We introduceEvoControl, a novel bi-level policy learning framework for learning both a slow high-level policy (using PPO) and a fast low-level policy (using Evolution Strategies) for solving continuous control tasks. Learning with Evolution Strategies for the lower-policy allows robust learning for long horizons that crucially arise when operating at higher frequencies. This enablesEvoControlto learn to control interactions at a high frequency, benefitting from more efficient exploration and credit assignment than direct high-frequency torque control without the need to hand-tune PD parameters. We empirically demonstrate thatEvoControlcan achieve a higher evaluation reward for continuous-control tasks compared to existing approaches, specifically excelling in tasks where high-frequency control is needed, such as those requiring safety-critical fast reactions.

Paperid:2244 poster

Authors:Jialin Zhao, Yingtao Zhang, Xinghang Li, Huaping Liu, Carlo Vittorio Cannistraci

Title: Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

Abstract:
The growing demands on GPU memory posed by the increasing number of neural network parameters call for training approaches that are more memory-efficient. Previous memory reduction training techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, face challenges, with LoRA being constrained by its low-rank structure, particularly during intensive tasks like pre-training, and ReLoRA suffering from saddle point issues. In this paper, we proposeSparseSpectralTraining(SST)to optimize memory usage forpre-training. SSTupdates all singular valuesandselectively updates singular vectorsthrough a multinomial sampling method weighted by the magnitude of the singular values. Furthermore, SST employssingular value decomposition to initialize and periodically reinitializelow-rank parameters, reducing distortion relative to full-rank training compared to other low-rank methods. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, SST demonstrates its ability to outperform existing memory reduction training methods and is comparable to full-rank training in various cases. On LLaMA-1.3B, with only 18.7% of the parameters trainable compared to full-rank training (using a rank equivalent to 6% of the embedding dimension), SST reduces the perplexity gap between other low-rank methods and full-rank training by97.4%. This result highlights SST as an effective parameter-efficient technique for model pre-training.

Paperid:2245 poster

Authors:Xiao Huang, Xu Liu, Enze Zhang, Tong Yu, Shuai Li

Title: Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Abstract:
Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent's stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.

Paperid:2246 poster

Authors:Peter Eckmann, Dongxia Wu, Germano Heinzelmann, Michael K Gilson, Rose Yu

Title: MF-LAL: Drug Compound Generation Using Multi-Fidelity Latent Space Active Learning

Abstract:
Current generative models for drug discovery primarily use molecular docking as an oracle to guide the generation of active compounds. However, such models are often not useful in practice because even compounds with high docking scores do not consistently show real-world experimental activity. More accurate methods for activity prediction exist, such as molecular dynamics based binding free energy calculations, but they are too computationally expensive to use in a generative model. To address this challenge, we propose Multi-Fidelity Latent space Active Learning (MF-LAL), a generative modeling framework that integrates a set of oracles with varying cost-accuracy tradeoffs. Using active learning, we train a surrogate model for each oracle and use these surrogates to guide generation of compounds with high predicted activity. Unlike previous approaches that separately learn the surrogate model and generative model, MF-LAL combines the generative and multi-fidelity surrogate models into a single framework, allowing for more accurate activity prediction and higher quality samples. Our experiments on two disease-relevant proteins show that MF-LAL produces compounds with significantly better binding free energy scores than other single and multi-fidelity approaches (~50% improvement in mean binding free energy score). The code is available athttps://github.com/Rose-STL-Lab/MF-LAL.

Paperid:2247 poster

Authors:Changshuo Liu, Lingze Zeng, Kaiping Zheng, Shaofeng Cai, Beng Chin Ooi, James Wei Luen Yip

Title: NeuralCohort: Cohort-aware Neural Representation Learning for Healthcare Analytics

Abstract:
Electronic health records (EHR) aggregate extensive data critical for advancing patient care and refining intervention strategies. EHR data is essential for epidemiological study, more commonly referred to as cohort study, where patients with shared characteristics or similar diseases are analyzed over time. Unfortunately, existing studies on cohort modeling are limited, struggling to derive fine-grained cohorts or effectively utilize cohort information, which hinders their ability to uncover intrinsic relationships between cohorts. To this end, we propose NeuralCohort, a cohort-aware neural representation learning method that precisely segments patients into finer-grained cohorts via an innovative cohort contextualization mechanism and captures both intra- and inter-cohort information using a Biscale Cohort Learning Module. Designed as a plug-in, NeuralCohort integrates seamlessly with existing backbone models, enhancing their cohort analysis capabilities by infusing deep cohort insights into the representation learning processes. The effectiveness and generalizability of NeuralCohort are validated across extensive real-world EHR datasets. Experimental results demonstrate that NeuralCohort consistently improves the performance of various backbone models, achieving up to an 8.1% increase in AUROC.

Paperid:2248 poster

Authors:Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal

Title: Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks

Abstract:
Text watermarking aims to subtly embeds statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only $0.88 per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model even mobile-level models. Our findings highlight the urgent need for more robust watermarking.

Paperid:2249 poster

Authors:Jungtaek Kim

Title: Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning

Abstract:
Bayesian optimization has attracted huge attention from diverse research areas in science and engineering, since it is capable of efficiently finding a global optimum of an expensive-to-evaluate black-box function. In general, a probabilistic regression model is widely used as a surrogate function to model an explicit distribution over function evaluations given an input to estimate and a training dataset. Beyond the probabilistic regression-based methods, density ratio estimation-based Bayesian optimization has been suggested in order to estimate a density ratio of the groups relatively close and relatively far to a global optimum. Developing this line of research further, supervised classifiers are employed to estimate a class probability for the two groups instead of a density ratio. However, the supervised classifiers used in this strategy are prone to be overconfident for known knowledge on global solution candidates. Supposing that we have access to unlabeled points, e.g., predefined fixed-size pools, we propose density ratio estimation-based Bayesian optimization with semi-supervised learning to solve this challenge. Finally, we show the empirical results of our methods and several baseline methods in two distinct scenarios with unlabeled point sampling and a fixed-size pool, and analyze the validity of our methods in diverse experiments.

Paperid:2250 poster

Authors:Yuchun Miao, Sen Zhang, Liang Ding, Yuqi Zhang, Lefei Zhang, Dacheng Tao

Title: The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

Abstract:
This work identifies theEnergy Loss Phenomenonin Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with anexcessiveincrease in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose anEnergy loss-aware PPO algorithm (EPPO)which penalizes the increase in energy loss in the LLM's final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of EPPO in mitigating reward hacking and improving RLHF performance.

Paperid:2251 poster

Authors:Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, Dongyan Zhao

Title: Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Abstract:
Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-Instruct-8B). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment algorithms against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for responsible deployment of powerful LLMs.

Paperid:2252 poster

Authors:Jianze Li, Jiezhang Cao, Yong Guo, Wenbo Li, Yulun Zhang

Title: One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation

Abstract:
Diffusion models (DMs) have significantly advanced the development of real-world image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods. The code and model will be released at \url{https://github.com/JianzeLi-114/FluxSR}.

Paperid:2253 poster

Authors:Yufei Guo, Yuhan Zhang, Zhou Jie, Xiaode Liu, Xin Tong, Yuanpei Chen, Weihang Peng, Zhe Ma

Title: ReverB-SNN: Reversing Bit of the Weight and Activation for Spiking Neural Networks

Abstract:
The Spiking Neural Network (SNN), a biologically inspired neural network infrastructure, has garnered significant attention recently. SNNs utilize binary spike activations for efficient information transmission, replacing multiplications with additions, thereby enhancing energy efficiency. However, binary spike activation maps often fail to capture sufficient data information, resulting in reduced accuracy. To address this challenge, we advocate reversing the bit of the weight and activation, called \textbf{ReverB}, inspired by recent findings that highlight greater accuracy degradation from quantizing activations compared to weights. Specifically, our method employs real-valued spike activations alongside binary weights in SNNs. This preserves the event-driven and multiplication-free advantages of standard SNNs while enhancing the information capacity of activations. Additionally, we introduce a trainable factor within binary weights to adaptively learn suitable weight amplitudes during training, thereby increasing network capacity. To maintain efficiency akin to vanilla \textbf{ReverB}, our trainable binary weight SNNs are converted back to standard form using a re-parameterization technique during inference. Extensive experiments across various network architectures and datasets, both static and dynamic, demonstrate that our approach consistently outperforms state-of-the-art methods.

Paperid:2254 poster

Authors:Brett Barkley, David Fridovich-Keil

Title: Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

Abstract:
Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in control tasks in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process --- the backbone of Dyna-style algorithms --- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.

Paperid:2255 poster

Authors:Yuan Feng, Yukun Cao, Hairu Wang, Xike Xie, S Kevin Zhou

Title: Lego Sketch: A Scalable Memory-augmented Neural Network for Sketching Data Streams

Abstract:
Sketches, probabilistic structures for estimating item frequencies in infinite data streams with limited space, are widely used across various domains. Recent studies have shifted the focus from handcrafted sketches to neural sketches, leveraging memory-augmented neural networks (MANNs) to enhance the streaming compression capabilities and achieve better space-accuracy trade-offs. However, existing neural sketches struggle to scale across different data domains and space budgets due to inflexible MANN configurations. In this paper, we introduce a scalable MANN architecture that brings to life the Lego sketch, a novel sketch with superior scalability and accuracy. Much like assembling creations with modular Lego bricks, the Lego sketch dynamically coordinates multiple memory bricks to adapt to various space budgets and diverse data domains. Theoretical analysis and empirical studies demonstrate its scalability and superior space-accuracy trade-offs, outperforming existing handcrafted and neural sketches.

Paperid:2256 poster

Authors:Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao

Title: CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Abstract:
Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-GEN, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-GEN facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the ResidueControlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-GEN enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins.

Paperid:2257 poster

Authors:Stephan Rabanser, Ali Shahin Shamsabadi, Olive Franzese, Xiao Wang, Adrian Weller, Nicolas Papernot

Title: Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention

Abstract:
Cautious predictions—where a machine learning model abstains when uncertain—are crucial for limiting harmful errors in safety-critical applications. In this work, we identify a novel threat: a dishonest institution can exploit these mechanisms to discriminate or unjustly deny services under the guise of uncertainty. We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage, which deliberately reduces confidence in targeted input regions, thereby covertly disadvantaging specific individuals. At the same time, Mirage maintains high predictive performance across all data points. To counter this threat, we propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence. Additionally, it employs zero-knowledge proofs of verified inference to ensure that reported confidence scores genuinely originate from the deployed model. This prevents the provider from fabricating arbitrary model confidence values while protecting the model’s proprietary details. Our results confirm that Confidential Guardian effectively prevents the misuse of cautious predictions, providing verifiable assurances that abstention reflects genuine model uncertainty rather than malicious intent.

Paperid:2258 poster

Authors:Filip Szatkowski, Yaoyue Zheng, Fei Yang, Tomasz Trzcinski, Bartłomiej Twardowski, Joost van de Weijer

Title: Improving Continual Learning Performance and Efficiency with Auxiliary Classifiers

Abstract:
Continual learning is crucial for applying machine learning in challenging, dynamic, and often resource-constrained environments. However, catastrophic forgetting �?overwriting previously learned knowledge when new information is acquired �?remains a major challenge. In this work, we examine the intermediate representations in neural network layers during continual learning and find that such representations are less prone to forgetting, highlighting their potential to accelerate computation. Motivated by these findings, we propose to use auxiliary classifiers~(ACs) to enhance performance and demonstrate that integrating ACs into various continual learning methods consistently improves accuracy across diverse evaluation settings, yielding an average 10% relative gain. We also leverage the ACs to reduce the average cost of the inference by 10-60% without compromising accuracy, enabling the model to return the predictions before computing all the layers. Our approach provides a scalable and efficient solution for continual learning.

Paperid:2259 poster

Authors:Yanbin Wei, Xuehao Wang, Zhan Zhuang, Yang Chen, Shuhao Chen, Yulong Zhang, James Kwok, Yu Zhang

Title: Open Your Eyes: Vision Enhances Message Passing Neural Networks in Link Prediction

Abstract:
Message-passing graph neural networks (MPNNs) and structural features (SFs) are cornerstones for the link prediction task. However, as a common and intuitive mode of understanding, the potential of visual perception has been overlooked in the MPNN community. For the first time, we equip MPNNs with vision structural awareness by proposing an effective framework called Graph Vision Network (GVN), along with a more efficient variant (E-GVN). Extensive empirical results demonstrate that with the proposed frameworks, GVN consistently benefits from the vision enhancement across seven link prediction datasets, including challenging large-scale graphs. Such improvements are compatible with existing state-of-the-art (SOTA) methods and GVNs achieve new SOTA results, thereby underscoring a promising novel direction for link prediction.

Paperid:2260 poster

Authors:Yu Chen, Jing Lian, Zhaofei Yu, Jizhao Liu, Jisheng Dang, Gang Wang

Title: A Chaotic Dynamics Framework Inspired by Dorsal Stream for Event Signal Processing

Abstract:
Event cameras are bio-inspired vision sensors that encode visual information with high dynamic range, high temporal resolution, and low latency. Current state-of-the-art event stream processing methods rely on end-to-end deep learning techniques. However, these models are heavily dependent on data structures, limiting their stability and generalization capabilities across tasks, thereby hindering their deployment in real-world scenarios. To address this issue, we propose a chaotic dynamics event signal processing framework inspired by the dorsal visual pathway of the brain. Specifically, we utilize Continuous-coupled Neural Network (CCNN) to encode the event stream. CCNN encodes polarity-invariant event sequences as periodic signals and polarity-changing event sequences as chaotic signals. We then use continuous wavelet transforms to analyze the dynamical states of CCNN neurons and establish the high-order mappings of the event stream. The effectiveness of our method is validated through integration with conventional classification networks, achieving state-of-the-art classification accuracy on the N-Caltech101 and N-CARS datasets, with results of 84.3% and 99.9%, respectively. Our method improves the accuracy of event camera-based object classification while significantly enhancing the generalization and stability of event representation.

Paperid:2261 poster

Authors:Kexuan Shi, Hai Chen, Leheng Zhang, Shuhang Gu

Title: Inductive Gradient Adjustment for Spectral Bias in Implicit Neural Representations

Abstract:
Implicit Neural Representations (INRs), as a versatile representation paradigm, have achieved success in various computer vision tasks. Due to the spectral bias of the vanilla multi-layer perceptrons (MLPs), existing methods focus on designing MLPs with sophisticated architectures or repurposing training techniques for highly accurate INRs. In this paper, we delve into the linear dynamics model of MLPs and theoretically identify the empirical Neural Tangent Kernel (eNTK) matrix as a reliable link between spectral bias and training dynamics. Based on this insight, we propose a practicalInductiveGradientAdjustment (IGA) method, which could purposefully improve the spectral bias via inductive generalization of eNTK-based gradient transformation matrix. Theoretical and empirical analyses validate impacts of IGA on spectral bias. Further, we evaluate our method on different INRs tasks with various INR architectures and compare to existing training techniques. The superior and consistent improvements clearly validate the advantage of our IGA. Armed with our gradient adjustment method, better INRs with more enhanced texture details and sharpened edges can be learned from data by tailored impacts on spectral bias. The codes are available at:https://github.com/LabShuHangGU/IGA-INR.

Paperid:2262 poster

Authors:Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos

Title: Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries

Abstract:
We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.

Paperid:2263 poster

Authors:Huahui Yi, Wei Xu, Ziyuan Qin, Xi Chen, Xiaohu Wu, Kang Li, Qicheng Lao

Title: iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection

Abstract:
Existing prompt-based approaches have demonstrated impressive performance in continual learning, leveraging pre-trained large-scale models for classification tasks; however, the tight coupling between foreground-background information and the coupled attention between prompts and image-text tokens present significant challenges in incremental medical object detection tasks, due to the conceptual gap between medical and natural domains. To overcome these challenges, we introduce the iDPA framework, which comprises two main components: 1) Instance-level Prompt Generation (IPG), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (DPA), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and mitigating catastrophic forgetting. We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as ODinM-13, and experiments demonstrate that iDPA outperforms existing SOTA methods, with FAP improvements of f 5.44%, 4.83%, 12.88%, and 4.59% in full data, 1-shot, 10-shot, and 50-shot settings, respectively.

Paperid:2264 poster

Authors:Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu

Title: De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

Abstract:
The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available athttps://de-antifake.github.io.

Paperid:2265 poster

Authors:Kaito Ariu, Alexandre Proutiere, Se-Young Yun

Title: Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model

Abstract:
In this paper, we investigate the problem of recovering hidden communities in the Labeled Stochastic Block Model (LSBM) with a finite number of clusters whose sizes grow linearly with the total number of nodes. We derive the necessary and sufficient conditions under which the expected number of misclassified nodes is less thans, for any numbers=o(n). To achieve this, we propose IAC (Instance-Adaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability. IAC is a novel two-phase algorithm that consists of a one-shot spectral clustering step followed by iterative likelihood-based cluster assignment improvements. This approach is based on the instance-specific lower bound and notably does not require any knowledge of the model parameters, including the number of clusters. By performing the spectral clustering only once, IAC maintains an overall computational complexity ofO(n,polylog(n)), making it scalable and practical for large-scale problems.

Paperid:2266 poster

Authors:Julia Gusak, Xunyi Zhao, Théotime Le Hellard, Zhe LI, Lionel Eyraud-Dubois, Olivier Beaumont

Title: HiRemate: Hierarchical Approach for Efficient Re-materialization of Neural Networks

Abstract:
Training deep neural networks (DNNs) on memory-limited GPUs is challenging, as storing intermediate activations often exceeds available memory. Re-materialization, a technique that preserves exact computations, addresses this by selectively recomputing activations instead of storing them. However, existing methods either fail to scale, lack generality, or introduce excessive execution overhead. We introduceHiRemateahierarchicalre-materialization framework that recursively partitions large computation graphs, applies optimized solvers at multiple levels, and merges solutions into a global efficient training schedule. This enables scalability to significantly larger graphs than prior ILP-based methods while keeping runtime overhead low. Designed for single-GPU models and activation re-materialization, HiRemate extends the feasibility of training networks with thousands of graph nodes, surpassing prior methods in both efficiency and scalability. Experiments on various types of networks yield up to 50-70% memory reduction with only 10-15% overhead, closely matching optimal solutions while significantly reducing solver time. Seamlessly integrating with PyTorch Autograd, HiRemate requires almost no code change to use, enabling broad adoption in memory-constrained deep learning.

Paperid:2267 poster

Authors:Xinyi Wan, Penghui Qi, Guangxing Huang, Min Lin, Jialin Li

Title: PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Abstract:
Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption.

Paperid:2268 poster

Authors:Suyu Liu, Zhiguang Cao, Shanshan Feng, Yew-Soon Ong

Title: A Mixed-Curvature based Pre-training Paradigm for Multi-Task Vehicle Routing Solver

Abstract:
Solving various types of vehicle routing problems (VRPs) using a unified neural solver has garnered significant attentions in recent years. Despite their effectiveness, existing neural multi-task solvers often fail to account for the geometric structures inherent in different tasks, which may result in suboptimal performance. To address this limitation, we propose a curvature-aware pre-training framework. Specifically, we leverage mixed-curvature spaces during the feature fusion stage, encouraging the model to capture the underlying geometric properties of each instance. Through extensive experiments, we evaluate the proposed pre-training strategy on existing neural multi-task solvers across a variety of testing scenarios. The results demonstrate that the curvature-aware pre-training approach not only enhances the generalization capabilities of existing neural VRP solvers on synthetic datasets but also improves solution quality on real-world benchmarks.

Paperid:2269 poster

Authors:Yifan Zhang, Zijian Wei, Haojie Ren, Changliang Zou

Title: e-GAI: e-value-based Generalizedα-Investing for Online False Discovery Rate Control

Abstract:
Online multiple hypothesis testing has attracted a lot of attention in many applications, e.g., anomaly status detection and stock market price monitoring. The state-of-the-art generalized $\alpha$-investing (GAI) algorithms can control online false discovery rate (FDR) on p-values only under specific dependence structures, a situation that rarely occurs in practice. The e-LOND algorithm (Xu & Ramdas, 2024) utilizes e-values to achieve online FDR control under arbitrary dependence but suffers from a significant loss in power as testing levels are derived from pre-specified descent sequences. To address these limitations, we propose a novel framework on valid e-values named e-GAI. The proposed e-GAI can ensure provable online FDR control under more general dependency conditions while improving the power by dynamically allocating the testing levels. These testing levels are updated not only by relying on both the number of previous rejections and the prior costs, but also, differing from the GAI framework, by assigning less $\alpha$-wealth for each rejection from a risk aversion perspective. Within the e-GAI framework, we introduce two new online FDR procedures, e-LORD and e-SAFFRON, and provide strategies for the long-term performance to address the issue of $\alpha$-death, a common phenomenon within the GAI framework. Furthermore, we demonstrate that e-GAI can be generalized to conditionally super-uniform p-values. Both simulated and real data experiments demonstrate the advantages of both e-LORD and e-SAFFRON in FDR control and power.

Paperid:2270 poster

Authors:Hao Wang, Zhichao Chen, Haotian Wang, Yanchao Tan, Licheng Pan, Tianqiao Liu, Xu Chen, Haoxuan Li, Zhouchen Lin

Title: Unbiased Recommender Learning from Implicit Feedback via Weakly Supervised Learning

Abstract:
Implicit feedback recommendation is challenged by the missing negative feedback essential for effective model training. Existing methods often resort to negative sampling, a technique that assumes unlabeled interactions as negative samples. This assumption risks misclassifying potential positive samples within the unlabeled data, thereby undermining model performance. To address this issue, we introduce PURL, a model-agnostic framework that reframes implicit feedback recommendation as a weakly supervised learning task, eliminating the need for negative samples. However, its unbiasedness hinges on the accurate estimation of the class prior. To address this challenge, we propose Progressive Proximal Transport (PPT), which estimates the class prior by minimizing the proximal transport cost between positive and unlabeled samples. Experiments on three real-world datasets validate the efficacy of PURL in terms of improved recommendation quality. Code is available athttps://github.com/HowardZJU/weakrec.

Paperid:2271 poster

Authors:Zhe Wang, Jiaxin Shi, Nicolas Heess, Arthur Gretton, Michalis Titsias

Title: Learning-Order Autoregressive Models with Application to Molecular Graph Generation

Abstract:
Autoregressive models (ARMs) have become the workhorse for sequence generation tasks, since many problems can be modeled as next-token prediction. While there appears to be a natural ordering for text (i.e., left-to-right), for many data types, such as graphs, the canonical ordering is less obvious. To address this problem, we introduce a variant of ARM that generates high-dimensional data using a probabilistic ordering that is sequentially inferred from data. This model incorporates a trainable probability distribution, referred to as an order-policy, that dynamically decides the autoregressive order in a state-dependent manner. To train the model, we introduce a variational lower bound on the exact log-likelihood, which we optimize with stochastic gradient estimation. We demonstrate experimentally that our method can learn meaningful autoregressive orderings in image and graph generation. On the challenging domain of molecular graph generation, we achieve state-of-the-art results on the QM9 and ZINC250k benchmarks, evaluated using the Fréchet ChemNet Distance (FCD), Synthetic Accessibility Score (SAS), Quantitative Estimate of Drug-likeness (QED).

Paperid:2272 poster

Authors:Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan

Title: Adaptive kernel predictors from feature-learning infinite limits of neural networks

Abstract:
Previous influential work showed that infinite width limits of neural networks in the lazy training regime are described by kernel machines. Here, we show that neural networks trained in the rich infinite-width regime in two different settings are also described by kernel machines, but with data-dependent kernels. For both cases, we provide explicit expressions for the kernel predictors and prescriptions to numerically calculate them. To derive the first predictor, we study the large-width limit of feature-learning Bayesian networks, showing how feature learning leads to task-relevant adaptation of layer kernels and preactivation densities. The saddle point equations governing this limit result in a min-max optimization problem that defines the kernel predictor. To derive the second predictor, we study gradient flow training of randomly initialized networks trained with weight decay in the infinite-width limit using dynamical mean field theory (DMFT). The fixed point equations of the arising DMFT defines the task-adapted internal representations and the kernel predictor. We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets.

Paperid:2273 poster

Authors:Lakshmi Nair, Ian Trase, J. Mark Kim

Title: Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options

Abstract:
We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). Flow-of-Options enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic framework developed for autonomously solving Machine Learning (ML) tasks. FoO enforces diversity in LLM solutions through compressed and interpretable task representations, resulting in improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks, as compared to state-of-the-art baselines. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Going beyond tabular classification and regression, we show the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our code is open-sourced at:https://github.com/flagshippioneering/Flow-of-Options.

Paperid:2274 poster

Authors:Le-Trung Nguyen, Aël Quélennec, Van-Tam Nguyen, Enzo Tartaglione

Title: Beyond Low-rank Decomposition: A Shortcut Approach for Efficient On-Device Learning

Abstract:
On-device learning has emerged as a promising direction for AI development, particularly because of its potential to reduce latency issues and mitigate privacy risks associated with device-server communication, while improving energy efficiency. Despite these advantages, significant memory and computational constraints still represent major challenges for its deployment. Drawing on previous studies on low-rank decomposition methods that address activation memory bottlenecks in backpropagation, we propose a novel shortcut approach as an alternative. Our analysis and experiments demonstrate that our method can reduce activation memory usage, even up to $120.09\times$ compared to vanilla training, while also reducing overall training FLOPs up to $1.86\times$ when evaluated on traditional benchmarks.

Paperid:2275 poster

Authors:Kaiwen Tang, Zhanglu Yan, Weng-Fai Wong

Title: Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

Abstract:
For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models targeted for deployment in resource-constrained devices where energy efficiency is critical. Spiking neural networks (SNNs) offer a promising solution due to their energy efficiency, and there are already works on realizing transformer-based models on SNNs. However, key operations like softmax and layer normalization (LN) are difficult to implement on neuromorphic hardware, and many of these early works sidestepped them. To address these challenges, we introduce Sorbet, a transformer-based spiking language model that is more neuromorphic hardware-compatible. Sorbet incorporates a novel shifting-based softmax called PTsoftmax and a BitShifting-based PowerNorm (BSPN), both designed to replace the respective energy-intensive operations. By leveraging knowledge distillation and model quantization, Sorbet achieved a highly compressed binary weight model that maintains competitive performance while achieving $27.16\times$ energy savings compared to BERT. We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference. Our code is publicly available athttps://github.com/Kaiwen-Tang/Sorbet

Paperid:2276 poster

Authors:Cheng Jin, Zhenyu Xiao, Chutao Liu, Yuantao Gu

Title: Angle Domain Guidance: Latent Diffusion Requires Rotation Rather Than Extrapolation

Abstract:
Classifier-free guidance (CFG) has emerged as a pivotal advancement in text-to-image latent diffusion models, establishing itself as a cornerstone technique for achieving high-quality image synthesis. However, under high guidance weights, where text-image alignment is significantly enhanced, CFG also leads to pronounced color distortions in the generated images. We identify that these distortions stem from the amplification of sample norms in the latent space. We present a theoretical framework that elucidates the mechanisms of norm amplification and anomalous diffusion phenomena induced by classifier-free guidance. Leveraging our theoretical insights and the latent space structure, we propose an Angle Domain Guidance (ADG) algorithm. ADG constrains magnitude variations while optimizing angular alignment, thereby mitigating color distortions while preserving the enhanced text-image alignment achieved at higher guidance weights. Experimental results demonstrate that ADG significantly outperforms existing methods, generating images that not only maintain superior text alignment but also exhibit improved color fidelity and better alignment with human perceptual preferences.

Paperid:2277 poster

Authors:Davide Adamo, Marco Corneli, Manon Vuillien, Emmanuelle Vila

Title: An in depth look at the Procrustes-Wasserstein distance: properties and barycenters

Abstract:
Due to its invariance to rigid transformations such as rotations and reflections, Procrustes-Wasserstein (PW) was introduced in the literature as an optimal transport (OT) distance, alternative to Wasserstein and more suited to tasks such as the alignment and comparison of point clouds. Having that application in mind, we carefully build a space of discrete probability measures and show that over that space PW actuallyisa distance. Algorithms to solve the PW problems already exist, however we extend the PW framework by discussing and testing several initialization strategies. We then introduce the notion of PW barycenter and detail an algorithm to estimate it from the data. The result is a new method to compute representative shapes from a collection of point clouds. We benchmark our method against existing OT approaches, demonstrating superior performance in scenarios requiring precise alignment and shape preservation. We finally show the usefulness of the PW barycenters in an archaeological context. Our results highlight the potential of PW in advancing 2D and 3D point cloud analysis for machine learning and computational geometry applications.

Paperid:2278 poster

Authors:Yinfeng Chen, Jin Liu, Rui Qiu

Title: Deep Principal Support Vector Machines for Nonlinear Sufficient Dimension Reduction

Abstract:
The normal vectors obtained from the support vector machine (SVM) method offer the potential to achieve sufficient dimension reduction in both classification and regression scenarios. Motivated by it, we in this paper introduce a unified framework for nonlinear sufficient dimension reduction based on classification ensemble. Kernel principal SVM, which leverages the reproducing kernel Hilbert space, can almost be regarded as a special case of this framework, and we generalize it by using a neural network function class for more flexible deep nonlinear reduction. We theoretically prove its unbiasedness with respect to the central $\sigma$-field and provide a nonasymptotic upper bound for the estimation error. Simulations and real data analysis demonstrate the considerable competitiveness of the proposed method, especially under heavy data contamination, large sample sizes, and complex inputs.

Paperid:2279 poster

Authors:Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Junnan Li, Silvio Savarese, Caiming Xiong, Doyen Sahoo

Title: Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Abstract:
Achieving effective unified pretraining on large time series corpora remains an open challenge in developing time series foundation models. Existing methods, such as Moirai, introduce multiple projection layers for time series of different frequencies to account for high data heterogeneity. We identify major drawbacks to this human-imposed frequency-level model specialization. First, frequency is not a reliable indicator for grouping pretraining data. Second, time series can display varied distributions even within a short window. Frequency-level specialization overlooks the diversity at this granularity. To address these issues, this paper introduces Moirai-MoE, excluding human-defined data groupings while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With this design, Moirai-MoE eliminates reliance on heuristics and enables automatic token-level specialization. Extensive evaluations on 39 datasets demonstrate the superiority of Moirai-MoE over state-of-the-art foundation models. This study also conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models.

Paperid:2280 poster

Authors:Emile Pierret, Bruno Galerne

Title: Diffusion models for Gaussian distributions: Exact solutions and Wasserstein errors

Abstract:
Diffusion or score-based models recently showed high performance in image generation. They rely on a forward and a backward stochastic differential equations (SDE). The sampling of a data distribution is achieved by numerically solving the backward SDE or its associated flow ODE. Studying the convergence of these models necessitates to control four different types of error: the initialization error, the truncation error, the discretization error and the score approximation. In this paper, we theoretically study the behavior of diffusion models and their numerical implementation when the data distribution is Gaussian. Our first contribution is to derive the analytical solutions of the backward SDE and the probability flow ODE and to prove that these solutions and their discretizations are all Gaussian processes. Our second contribution is to compute the exact Wasserstein errors between the target and the numerically sampled distributions for any numerical scheme. This allows us to monitor convergence directly in the data space, while experimental works limit their empirical analysis to Inception features. An implementation of our code is available online.

Paperid:2281 poster

Authors:Yunze Tong, Fengda Zhang, Zihao Tang, Kaifeng Gao, Kai Huang, Pengfei Lyu, Jun Xiao, Kun Kuang

Title: Latent Score-Based Reweighting for Robust Classification on Imbalanced Tabular Data

Abstract:
Machine learning models often perform well on tabular data by optimizing average prediction accuracy. However, they may underperform on specific subsets due to inherent biases and spurious correlations in the training data, such as associations with non-causal features like demographic information. These biases lead to critical robustness issues as models may inherit or amplify them, resulting in poor performance where such misleading correlations do not hold. Existing mitigation methods have significant limitations: some require prior group labels, which are often unavailable, while others focus solely on the conditional distribution (P(Y|X)), upweighting misclassified samples without effectively balancing the overall data distribution (P(X)). To address these shortcomings, we propose a latent score-based reweighting framework. It leverages score-based models to capture the joint data distribution (P(X, Y)) without relying on additional prior information. By estimating sample density through the similarity of score vectors with neighboring data points, our method identifies underrepresented regions and upweights samples accordingly. This approach directly tackles inherent data imbalances, enhancing robustness by ensuring a more uniform dataset representation. Experiments on various tabular datasets under distribution shifts demonstrate that our method effectively improves performance on imbalanced data.

Paperid:2282 poster

Authors:Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin

Title: MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

Abstract:
Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available athttps://github.com/cat538/MxMoE.

Paperid:2283 poster

Authors:Yunhak Oh, Junseok Lee, Yeongmin Kim, Sangwoo Seo, Namkyeong Lee, Chanyoung Park

Title: Global Context-aware Representation Learning for Spatially Resolved Transcriptomics

Abstract:
Spatially Resolved Transcriptomics (SRT) is a cutting-edge technique that captures the spatial context of cells within tissues, enabling the study of complex biological networks. Recent graph-based methods leverage both gene expression and spatial information to identify relevant spatial domains. However, these approaches fall short in obtaining meaningful spot representations, especially for spots near spatial domain boundaries, as they heavily emphasize adjacent spots that have minimal feature differences from an anchor node. To address this, we propose Spotscape, a novel framework that introduces the Similarity Telescope module to capture global relationships between multiple spots. Additionally, we propose a similarity scaling strategy to regulate the distances between intra- and inter-slice spots, facilitating effective multi-slice integration. Extensive experiments demonstrate the superiority of Spotscape in various downstream tasks, including single-slice and multi-slice scenarios.

Paperid:2284 poster

Authors:Xiaoyan Hu, Ho-fung Leung, Farzan Farnia

Title: An Online Learning Approach to Prompt-based Selection of Generative Models and LLMs

Abstract:
Selecting a sample generation scheme from multiple prompt-based generative models, including large language models (LLMs) and prompt-guided image and video generation models, is typically addressed by choosing the model that maximizes an averaged evaluation score. However, this score-based selection overlooks the possibility that different models achieve the best generation performance for different types of text prompts. An online identification of the best generation model for various input prompts can reduce the costs associated with querying sub-optimal models. In this work, we explore the possibility of varying rankings of text-based generative models for different text prompts and propose an online learning framework to predict the best data generation model for a given input prompt. The proposed PAK-UCB algorithm addresses a contextual bandit (CB) setting with shared context variables across the arms, utilizing the generated data to update kernel-based functions that predict the score of each model available for unseen text prompts. Additionally, we leverage random Fourier features (RFF) to accelerate the online learning process of PAK-UCB. Our numerical experiments on real and simulated text-to-image and image-to-text generative models show that RFF-UCB performs successfully in identifying the best generation model across different sample types. The code is available at:github.com/yannxiaoyanhu/dgm-online-select.

Paperid:2285 poster

Authors:Seungjun Shin, Jaehoon Oh, Dokwan Oh

Title: OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference

Abstract:
Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.

Paperid:2286 poster

Authors:Kapilan Balagopalan, Ngo Tuan Nguyen, Yao Zhao, Kwang-Sung Jun

Title: Fixing the Loose Brake: Exponential-Tailed Stopping Time in Best Arm Identification

Abstract:
The best arm identification problem requires identifying the best alternative (i.e., arm) in active experimentation using the smallest number of experiments (i.e., arm pulls), which is crucial for cost-efficient and timely decision-making processes. In the fixed confidence setting, an algorithm must stop data-dependently and return the estimated best arm with a correctness guarantee. Since this stopping time is random, we desire its distribution to have light tails. Unfortunately, many existing studies focus on high probability or in expectation bounds on the stopping time, which allow heavy tails and, for high probability bounds, even not stopping at all. We first prove that this never-stopping event can indeed happen for some popular algorithms. Motivated by this, we propose algorithms that provably enjoy an exponential-tailed stopping time, which improves upon the polynomial tail bound reported by Kalyanakrishnan et al. (2012). The first algorithm is based on a fixed budget algorithm called Sequential Halving along with a doubling trick. The second algorithm is a meta algorithm that takes in any fixed confidence algorithm with a high probability stopping guarantee and turns it into one that enjoys an exponential-tailed stopping time. Our results imply that there is much more to be desired for contemporary fixed confidence algorithms.

Paperid:2287 poster

Authors:DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, Qinqing Zheng

Title: Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Abstract:
Large Language Models (LLMs) excel at reasoning and planning when trained on chain-of-thought (CoT) data, where the step-by-step thought process is explicitly outlined by text tokens. However, this results in lengthy inputs where many words support textual coherence rather than core reasoning information, and processing these inputs consumes substantial computation resources. In this work, we propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens generated by VQ-VAE, significantly reducing the length of reasoning traces. We explore the use of latent trace abstractions in two scenarios: 1) training the model from scratch for the Keys-Finding Maze problem, 2) fine-tuning LLMs on this hybrid data with an extended vocabulary including unseen latent tokens, for both logical and mathematical reasoning problems. To facilitate effective learning, we introduce a simple training procedure that randomly mixes latent and text tokens, which enables fast adaptation to new latent tokens. Our approach consistently outperforms the baselines methods in various benchmarks, such as Math (+4.2%, Llama-3.2-1B), GSM8K (+4.1%, Llama-3.2-3B), and Fresh-Gaokao-Math-2023 (+13.3%, Llama-3.1-8B) with an average reduction of 17% in reasoning trace's length.

Paperid:2288 poster

Authors:Tingyu Zhu, Haoyu Liu, Ziyu Wang, Zhimin Jiang, Zeyu Zheng

Title: Efficient Fine-Grained Guidance for Diffusion Model Based Symbolic Music Generation

Abstract:
Developing generative models to create or conditionally create symbolic music presents unique challenges due to the combination of limited data availability and the need for high precision in note pitch. To address these challenges, we introduce an efficient Fine-Grained Guidance (FGG) approach within diffusion models. FGG guides the diffusion models to generate music that aligns more closely with the control and intent of expert composers, which is critical to improve the accuracy, listenability, and quality of generated music. This approach empowers diffusion models to excel in advanced applications such as improvisation, and interactive music creation. We derive theoretical characterizations for both the challenges in symbolic music generation and the effects of the FGG approach. We provide numerical experiments and subjective evaluation to demonstrate the effectiveness of our approach. We have published a demo page to showcase performances, which enables real-time interactive generation.

Paperid:2289 poster

Authors:Qingqing Cao, Mahyar Najibi, Sachin Mehta

Title: CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Abstract:
Pretraining robust vision or multimodal foundation models (e.g., CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous works have shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., either image or text only, but not both), and are limited in data diversity due to a lack of fine-grained control over the synthesis process. In this paper, we design a controllable image-text synthesis pipeline, CtrlSynth, for data-efficient and robust multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., remove, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows users to control data synthesis in a fine-grained manner by defining customized control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pretrained foundation models such as large language models or diffusion models to reason and recompose basic elements such that synthetic samples are natural and composed in diverse ways. CtrlSynth is a closed-loop, training-free, and modular framework, making it easy to support different pretrained models. With extensive experiments on 31 datasets spanning different vision and vision-language tasks, we show that CtrlSynth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.

Paperid:2290 poster

Authors:Zehong Wang, Zheyuan Zhang, Tianyi Ma, Nitesh V Chawla, Chuxu Zhang, Yanfang Ye

Title: Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-Trees

Abstract:
Foundation models are pretrained on large-scale corpora to learn generalizable patterns across domains and tasks---such as contours, textures, and edges in images, or tokens and sentences in text. In contrast, discovering such generalities in graph-structured data, especially across heterogeneous graph tasks, remains an open challenge. To address this, we propose a novel approach to cross-task generalization in graphs via task-trees, which serve as unified learning instances aligning node-, edge-, and graph-level tasks. We theoretically analyze the stability, transferability, and generalization properties of task-trees, showing that pretraining a graph neural network (GNN) on diverse task-trees with a reconstruction objective induces transferable knowledge. This enables efficient adaptation to downstream tasks with minimal fine-tuning. To validate our framework, we introduce Graph Generality Identifier on Task-Trees (GIT), a graph foundation model that demonstrates strong performance on over 30 graphs across five domains via fine-tuning, in-context learning, and zero-shot generalization. Code and data are available athttps://github.com/Zehong-Wang/GIT.

Paperid:2291 poster

Authors:Luise Ge, Michael Lanier, Anindya Sarkar, Bengisu Guresti, Chongjie Zhang, Yevgeniy Vorobeychik

Title: Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks

Abstract:
Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. On the other hand, approaches that aim to tackle task diversity, such as using task embedding as policy context and task clustering, typically lack performance guarantees and require a large number of training tasks. To address these challenges, we propose a novel approach for learning a policy committee that includes at least one near-optimal policy with high probability for tasks encountered during execution. While we show that this problem is in general inapproximable, we present two practical algorithmic solutions. The first yields provable approximation and task sample complexity guarantees when tasks are low-dimensional (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide a provable sample complexity bound for few-shot learning. Our experiments on MuJoCo and Meta-World show that the proposed approach outperforms state-of-the-art multi-task, meta-, and task clustering baselines in training, generalization, and few-shot learning, often by a large margin. Our code is available athttps://github.com/CERL-WUSTL/PACMAN.

Paperid:2292 poster

Authors:Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang

Title: MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Abstract:
Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

Paperid:2293 poster

Authors:Lukas Helff, Felix Friedrich, Manuel Brack, Kristian Kersting, Patrick Schramowski

Title: LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

Abstract:
This paper introduces Llavaguard, a suite of VLM-based vision safeguards that address the critical need for reliable tools in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting Llavaguard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, Llavaguard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate Llavaguard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code, publicly available athttps://ml-research.github.io/human-centered-genai/projects/llavaguard.

Paperid:2294 poster

Authors:MingCai Chen, Baoming Zhang, Zongbo Han, Wenyu Jiang, Yanmeng Wang, Shuai Feng, Yuntao Du., Bingkun BAO

Title: Test-Time Selective Adaptation for Uni-Modal Distribution Shift in Multi-Modal Data

Abstract:
Modern machine learning applications are characterized by the increasing size of deep models and the growing diversity of data modalities. This trend underscores the importance of efficiently adapting pre-trained multi-modal models to the test distribution in real time, i.e., multi-modal test-time adaptation. In practice, the magnitudes of multi-modal shifts vary because multiple data sources interact with the impact factor in diverse manners. In this research, we investigate the the under-explored practical scenariouni-modal distribution shift, where the distribution shift influences only one modality, leaving the others unchanged. Through theoretical and empirical analyses, we demonstrate that the presence of such shift impedes multi-modal fusion and leads to the negative transfer phenomenon in existing test-time adaptation techniques. To flexibly combat this unique shift, we propose a selective adaptation schema that incorporates multiple modality-specific adapters to accommodate potential shifts and a ``router'' module that determines which modality requires adaptation. Finally, we validate the effectiveness of our proposed method through extensive experimental evaluations. Code available athttps://github.com/chenmc1996/Uni-Modal-Distribution-Shift.

Paperid:2295 poster

Authors:Yaxin Hou, Yuheng Jia

Title: A Square Peg in a Square Hole: Meta-Expert for Long-Tailed Semi-Supervised Learning

Abstract:
This paper studies the long-tailed semi-supervised learning (LTSSL) with distribution mismatch, where the class distribution of the labeled training data follows a long-tailed distribution and mismatches with that of the unlabeled training data. Most existing methods introduce auxiliary classifiers (experts) to model various unlabeled data distributions and produce pseudo-labels, but the expertises of various experts are not fully utilized. We observe that different experts are good at predicting different intervals of samples, e.g., long-tailed expert is skilled in samples located in the head interval and uniform expert excels in samples located in the medium interval. Therefore, we propose a dynamic expert assignment module that can estimate the class membership (i.e., head, medium, or tail class) of samples, and dynamically assigns suitable expert to each sample based on the estimated membership to produce high-quality pseudo-label in the training phase and produce prediction in the testing phase. We also theoretically reveal that integrating different experts' strengths will lead to a smaller generalization error bound. Moreover, we find that the deeper features are more biased toward the head class but with more discriminative ability, while the shallower features are less biased but also with less discriminative ability. We, therefore, propose a multi-depth feature fusion module to utilize different depth features to mitigate the model bias. Our method demonstrates its effectiveness through comprehensive experiments on the CIFAR-10-LT, STL-10-LT, and SVHN-LT datasets across various settings. The code is available athttps://github.com/yaxinhou/Meta-Expert.

Paperid:2296 poster

Authors:Zhilong Zhang, Tian Xu, Xinghao Du, Xingchen Cao, Yihao Sun, Yang Yu

Title: Improving Reward Model Generalization from Adversarial Process Enhanced Preferences

Abstract:
In sequential decision-making, the reward function serves as the primary supervision signal, guiding agents to acquire the desired behaviors. Traditional reward modeling methods rely heavily on human expertise, limiting their scalability. Automated preference generation from suboptimal demonstrations has emerged as a promising alternative to address this limitation. This approach first generates preference data from suboptimal demonstrations and then trains reward models based on these preferences. Despite its potential, existing methods often struggle to generate preference data with sufficient coverage, limiting the accuracy and generalizability of the resulting reward models. To overcome this limitation, we propose APEC (Automated Preference generation with Enhanced Coverage), a novel method that improves the coverage of preference data. APEC achieves this by selecting policy pairs with significantly different iteration indices from the whole adversarial imitation learning process. We provide a theoretical analysis to validate that the selected policy pairs provably hold preference relationships. Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance.

Paperid:2297 poster

Authors:Shuqi Lu, Xiaohong Ji, Bohang Zhang, Lin Yao, Siyuan Liu, Zhifeng Gao, Linfeng Zhang, Guolin Ke

Title: Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling

Abstract:
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components: (1)grid-based space discretization; (2)grid sampling/merging; and (3)efficient 3D positional encoding. Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.

Paperid:2298 poster

Authors:Łukasz Struski, Michal B. Bednarczyk, Igor T. Podolak, Jacek Tabor

Title: LapSum - One Method to Differentiate Them All: Ranking, Sorting and Top-k Selection

Abstract:
We present a novel technique for constructing differentiable order-type operations, including soft ranking, soft top-k selection, and soft permutations. Our approach leverages an efficient closed-form formula for the inverse of the function LapSum, defined as the sum of Laplace distributions. This formulation ensures low computational and memory complexity in selecting the highest activations, enabling losses and gradients to be computed inO(nlog⁡n)time. Through extensive experiments, we demonstrate that our method outperforms state-of-the-art techniques for high-dimensional vectors and largekvalues. Furthermore, we provide efficient implementations for both CPU and CUDA environments, underscoring the practicality and scalability of our method for large-scale ranking and differentiable ordering problems.

Paperid:2299 poster

Authors:Shikang Liu, Chuyang Wei, Xiren Zhou, Huanhuan Chen

Title: Spectral-Aware Reservoir Computing for Fast and Accurate Time Series Classification

Abstract:
Analyzing inherent temporal dynamics is a critical pathway for time series classification, where Reservoir Computing (RC) exhibits effectiveness and high efficiency. However, typical RC considers recursive updates from adjacent states, struggling with long-term dependencies. In response, this paper proposes a Spectral-Aware Reservoir Computing framework (SARC), incorporating spectral insights to enhance long-term dependency modeling. Prominent frequencies are initially extracted to reveal explicit or implicit cyclical patterns. For each prominent frequency, SARC further integrates a Frequency-informed Reservoir Network (FreqRes) to adequately capture both sequential and cyclical dynamics, thereby deriving effective dynamic features. Synthesizing these features across various frequencies, SARC offers a multi-scale analysis of temporal dynamics and improves the modeling of long-term dependencies. Experiments on public datasets demonstrate that SARC achieves state-of-the-art results, while maintaining high efficiency compared to existing methods.

Paperid:2300 poster

Authors:Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping

Title: An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks

Abstract:
A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram language model on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, nonparametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After an extensive comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent bigrams, either selecting the ones absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.

Paperid:2301 poster

Authors:Yiming Sun, Xin Li, Pengfei Zhu, Qinghua Hu, Dongwei Ren, Huiying Xu, Xinzhong Zhu

Title: Task-Gated Multi-Expert Collaboration Network for Degraded Multi-Modal Image Fusion

Abstract:
Multi-modal image fusion aims to integrate complementary information from different modalities to enhance perceptual capabilities in applications such as rescue and security. However, real-world imaging often suffers from degradation issues, such as noise, blur, and haze in visible imaging, as well as stripe noise in infrared imaging, which significantly degrades model performance. To address these challenges, we propose a task-gated multi-expert collaboration network (TG-ECNet) for degraded multi-modal image fusion. The core of our model lies in the task-aware gating and multi-expert collaborative framework, where the task-aware gating operates in two stages: degradation-aware gating dynamically allocates expert groups for restoration based on degradation types, and fusion-aware gating guides feature integration across modalities to balance information retention between fusion and restoration tasks. To achieve this, we design a two-stage training strategy that unifies the learning of restoration and fusion tasks. This strategy resolves the inherent conflict in information processing between the two tasks, enabling all-in-one multi-modal image restoration and fusion. Experimental results demonstrate that TG-ECNet significantly enhances fusion performance under diverse complex degradation conditions and improves robustness in downstream applications. The code is available athttps://github.com/LeeX54946/TG-ECNet.

Paperid:2302 poster

Authors:Zhong Guan, Likang Wu, Hongke Zhao, Ming He, Jianping Fan

Title: Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data

Abstract:
Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graph-structured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?'' Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. Through a series of experiments, we uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention (as in LLMs) nor fixed connectivity (as in GNNs) is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \href{https://anonymous.4open.science/r/LLM_exploration-B21F}{anonymous.4open.science/LLM\_exploration-B21F}

Paperid:2303 poster

Authors:Xiang Zhang, Run He, Chen Jiao, Di Fang, Ming Li, Ziqian Zeng, Cen Chen, Huiping Zhuang

Title: L3A: Label-Augmented Analytic Adaptation for Multi-Label Class Incremental Learning

Abstract:
Class-incremental learning (CIL) enables models to learn new classes continually without forgetting previously acquired knowledge. Multi-label CIL (MLCIL) extends CIL to a real-world scenario where each sample may belong to multiple classes, introducing several challenges: label absence, which leads to incomplete historical information due to missing labels, and class imbalance, which results in the model bias toward majority classes. To address these challenges, we propose Label-Augmented Analytic Adaptation (L3A), an exemplar-free approach without storing past samples. L3A integrates two key modules. The pseudo-label (PL) module implements label augmentation by generating pseudo-labels for current phase samples, addressing the label absence problem. The weighted analytic classifier (WAC) derives a closed-form solution for neural networks. It introduces sample-specific weights to adaptively balance the class contribution and mitigate class imbalance. Experiments on MS-COCO and PASCAL VOC datasets demonstrate that L3A outperforms existing methods in MLCIL tasks. Our code is available athttps://github.com/scut-zx/L3A.

Paperid:2304 poster

Authors:Kento Nishi, Rahul Ramesh, Maya Okawa, Mikail Khona, Hidenori Tanaka, Ekdeep Singh Lubana

Title: Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

Abstract:
Knowledge Editing (KE) algorithms alter models' weights to perform targeted updates to incorrect, outdated, or otherwise unwanted factual associations. However, recent work has shown that applying KE can adversely affect models' broader factual recall accuracy and diminish their reasoning abilities. Although these studies give insights into the potential harms of KE algorithms, e.g., performance evaluations on benchmarks, little is understood about why such destructive failures occur. Motivated by this, we define a novel synthetic task in which a Transformer is trained from scratch to internalize a "structured" knowledge graph. The structure enforces relationships between entities of the graph, such that editing a factual association has "trickling effects" on other entities (e.g., altering X's parent is Y to Z affects who X's siblings' parent is). Through evaluations of edited models on this task, we show that KE inadvertently affects representations of entities beyond the targeted one, distorting relevant structures that allow a model to infer unseen knowledge about an entity. We call this phenomenon representation shattering and demonstrate that it degrades models' factual recall and reasoning performance. We further corroborate our findings in naturalistic settings with pre-trained Llama and Mamba models as well. Overall, our work yields a precise mechanistic hypothesis to explain why KE has adverse effects on model abilities.

Paperid:2305 poster

Authors:Jiannian Wang, Yao Lu, Guangming Lu

Title: SSHR: More Secure Generative Steganography with High-Quality Revealed Secret Images

Abstract:
Image steganography ensures secure information transmission and storage by concealing secret messages within images. Recently, the diffusion model has been incorporated into the generative image steganography task, with text prompts being employed to guide the entire process. However, existing methods are plagued by three problems: (1) the restricted control exerted by text prompts causes generated stego images resemble the secret images and seem unnatural, raising the severe detection risk; (2) inconsistent intermediate states between Denoising Diffusion Implicit Models and its inversion, coupled with limited control of text prompts degrade the revealed secret images; (3) the descriptive text of images(i.e. text prompts) are also deployed as the keys, but this incurs significant security risks for both the keys and the secret images.To tackle these drawbacks, we systematically propose the SSHR, which joints the Reference Images with the adaptive keys to govern the entire process, enhancing the naturalness and imperceptibility of stego images. Additionally, we methodically construct an Exact Reveal Process to improve the quality of the revealed secret images. Furthermore, adaptive Reference-Secret Image Related Symmetric Keys are generated to enhance the security of both the keys and the concealed secret images. Various experiments indicate that our model outperforms existing methods in terms of recovery quality and secret image security.

Paperid:2306 poster

Authors:Jie Wang, March Boedihardjo, Yao Xie

Title: Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances

Abstract:
Optimal transport has been very successful for various machine learning tasks; however, it is known to suffer from the curse of dimensionality. Hence, dimensionality reduction is desirable when applied to high-dimensional data with low-dimensional structures. The kernel max-sliced (KMS) Wasserstein distance is developed for this purpose by finding an optimal nonlinear mapping that reduces data into $1$ dimension before computing the Wasserstein distance. However, its theoretical properties have not yet been fully developed. In this paper, we provide sharp finite-sample guarantees under milder technical assumptions compared with state-of-the-art for the KMS $p$-Wasserstein distance between two empirical distributions with $n$ samples for general $p\in[1,\infty)$. Algorithm-wise, we show that computing the KMS $2$-Wasserstein distance is NP-hard, and then we further propose a semidefinite relaxation (SDR) formulation (which can be solved efficiently in polynomial time) and provide a relaxation gap for the obtained solution. We provide numerical examples to demonstrate the good performance of our scheme for high-dimensional two-sample testing.

Paperid:2307 poster

Authors:Nannan Wu, Yuming Huang, Yiming Zhao, Jie Chen, Wenjun Wang

Title: GPEN: Global Position Encoding Network for Enhanced Subgraph Representation Learning

Abstract:
Subgraph representation learning has attracted growing interest due to its wide applications in various domains. However, existing methods primarily focus on local neighborhood structures while overlooking the significant impact of global structural information, in particular the influence of multi-hop neighbors beyond immediate neighborhoods. This presents two key challenges: how to effectively capture the structural relationships between distant nodes, and how to prevent excessive aggregation of global structural information from weakening the discriminative ability of subgraph representations. To address these challenges, we propose GPEN (Global Position Encoding Network). GPEN leverages a hierarchical tree structure to encode each node's global position based on its path distance to the root node, enabling a systematic way to capture relationships between distant nodes. Furthermore, we introduce a boundary-aware convolution module that selectively integrates global structural information while maintaining the unique structural patterns of each subgraph. Extensive experiments on eight public datasets identify that GPEN significantly outperforms state-of-the-art methods in subgraph representation learning.

Paperid:2308 poster

Authors:Hengyuan Hu, Aniket Das, Dorsa Sadigh, Nima Anari

Title: Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Auto Speculation

Abstract:
Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduceAutospeculative Decoding(ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a $\tilde{O}(K^{\frac{1}{3}})$ parallel runtime speedup over the $K$ step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains.

Paperid:2309 poster

Authors:Vikram Kher, Manolis Zampetakis

Title: Safely Learning Optimal Auctions: A Testable Learning Framework for Mechanism Design

Abstract:
When can the distributional assumptions of theorems and learning algorithms be trusted? Inspired by this question, Rubinfeld and Vasilyan (2023) initiated the study of testable learning. In this schema, we always learn one of the following two things: either we have achieved the desired accuracy regardless of whether the distributional assumptions are satisfied, or the input distribution does not satisfy the original distributional assumptions. Motivated by the challenge of relying on strong distributional assumptions in many theorems in mechanism design, we develop a testable learning framework for mechanism design. Traditional models in mechanism design assume that value distributions satisfy some notion of regularity. Unfortunately, testing regularity is not possible in the original testable learning framework as we show. To bypass this impossibility, we propose a regularized version of the testable learning framework. Under this framework, we always learn one of the following two things: either we achieve high revenue compared to the best possible revenue of any regular distribution close to the input distribution, or the input distribution does not satisfy regularity. We then use this framework to provide: 1) a tester-learner pair for revenue optimal mechanisms, 2) a tester for whether the fundamental Bulow-Klemperer Theorem (Bulow and Klemperer 1996) is applicable to a given dataset, and 3) a tester to confirm the existence of an anonymous reserve price that results in the anonymous price auction securing a constant fraction of the optimal revenue.

Paperid:2310 poster

Authors:Can Chen, Jun-Kun Wang

Title: Online Detection of LLM-Generated Texts via Sequential Hypothesis Testing by Betting

Abstract:
Developing algorithms to differentiate between machine-generated texts and human-written texts has garnered substantial attention in recent years. Existing methods in this direction typically concern an offline setting where a dataset containing a mix of real and machine-generated texts is given upfront, and the task is to determine whether each sample in the dataset is from a large language model (LLM) or a human. However, in many practical scenarios, sources such as news websites, social media accounts, and online forums publish content in a streaming fashion. Therefore, in this online scenario, how to quickly and accurately determine whether the source is an LLM with strong statistical guarantees is crucial for these media or platforms to function effectively and prevent the spread of misinformation and other potential misuse of LLMs. To tackle the problem of online detection, we develop an algorithm based on the techniques of sequential hypothesis testing by betting that not only builds upon and complements existing offline detection techniques but also enjoys statistical guarantees, which include a controlled false positive rate and the expected time to correctly identify a source as an LLM. Experiments were conducted to demonstrate the effectiveness of our method.

Paperid:2311 poster

Authors:Jiawei Cao, Chaochen Gu, Hao Cheng, Xiaofeng Zhang, Kaijie Wu, Changsheng Lu

Title: EFDTR: Learnable Elliptical Fourier Descriptor Transformer for Instance Segmentation

Abstract:
Polygon-based object representations efficiently model object boundaries but are limited by high optimization complexity, which hinders their adoption compared to more flexible pixel-based methods. In this paper, we introduce a novel vertex regression loss grounded in Fourier elliptic descriptors, which removes the need for rasterization or heuristic approximations and resolves ambiguities in boundary point assignment through frequency-domain matching. To advance polygon-based instance segmentation, we further propose EFDTR (\textbf{E}lliptical \textbf{F}ourier \textbf{D}escriptor \textbf{Tr}ansformer), an end-to-end learnable framework that leverages the expressiveness of Fourier-based representations. The model achieves precise contour predictions through a two-stage approach: the first stage predicts elliptical Fourier descriptors for global contour modeling, while the second stage refines contours for fine-grained accuracy. Experimental results on the COCO dataset show that EFDTR outperforms existing polygon-based methods, offering a promising alternative to pixel-based approaches. Code is available at \url{https://github.com/chrisclear3/EFDTR}.

Paperid:2312 poster

Authors:Ruben Weitzman, Peter Mørch Groth, Lood Van Niekerk, Aoi Otani, Yarin Gal, Debora Susan Marks, Pascal Notin

Title: Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction

Abstract:
Retrieving homologous protein sequences is essential for a broad range of protein modeling tasks such as fitness prediction, protein design, structure modeling, and protein-protein interactions. Traditional workflows have relied on a two-step process: first retrieving homologs via Multiple Sequence Alignments (MSA), then training mod- els on one or more of these alignments. However, MSA-based retrieval is computationally expensive, struggles with highly divergent sequences or complex insertions & deletions patterns, and operates independently of the downstream modeling objective. We introduce Protriever, an end-to-end differentiable framework that learns to retrieve relevant homologs while simultaneously training for the target task. When applied to protein fitness prediction, Protriever achieves state-of-the-art performance compared to sequence-based models that rely on MSA-based homolog retrieval, while being two orders of magnitude faster through efficient vector search. Protriever is both architecture and task-agnostic, and can flexibly adapt to different retrieval strategies and protein databases at inference time �?offering a scalable alternative to alignment-centric approaches.

Paperid:2313 poster

Authors:Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, Mengdi Wang

Title: MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

Abstract:
Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycks et al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models. The project is available athttps://math-perturb.github.io/.

Paperid:2314 poster

Authors:Yifan HAO, Xingyuan Pan, Hanning Zhang, Chenlu Ye, Rui Pan, Tong Zhang

Title: Understanding Overadaptation in Supervised Fine-Tuning: The Role of Ensemble Methods

Abstract:
Supervised fine-tuning (SFT) on domain-specific data is the dominant approach for adapting foundation models to specialized tasks. However, it has been observed that SFT models tend to forget knowledge acquired during pretraining. In vision models, ensembling a pretrained model with its fine-tuned counterpart has been shown to mitigate this issue. In this work, we demonstrate that the same holds for language models, and, more strikingly, we observe an overadaptation phenomenon: the ensemble model not only retains general knowledge from the foundation model but also outperforms the fine-tuned model even on the fine-tuning domain itself. Despite the empirical success of ensembling, a theoretical understanding of its benefits remains underexplored. We develop a formal theoretical analysis of the overadaptation phenomenon. Ensembling mitigates this by balancing two primary sources of error: bias, caused by insufficient fine-tuning, and variance, introduced by overfitting to fine-tuning data. While regularization techniques aim to address this trade-off, we show that ensembling provides a more effective solution. We analyze this phenomenon in over-parameterized linear settings and demonstrate that interpolating between pretrained and fine-tuned weights significantly improves performance. These findings offer theoretical justification for the observed advantages of model ensembling, supported by empirical experiments consistent with our analysis.

Paperid:2315 poster

Authors:Rogelio A. Mancisidor, Robert Jenssen, Shujian Yu, Michael Kampffmeyer

Title: Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders

Abstract:
Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO). Current methods, the product and mixture of experts, aggregate single-modality distributions assuming independence for simplicity, which is an overoptimistic assumption. This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle ofconsensus of dependent experts(CoDE), which circumvents the aforementioned assumption. Utilizing the CoDE method, we propose a novel ELBO that approximates the joint likelihood of the multimodal data by learning the contribution of each subset of modalities. The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. CoDE-VAE further minimizes the generative quality gap as the number of modalities increases. In certain cases, it reaches a generative quality similar to that of unimodal VAEs, which is a desirable property that is lacking in most current methods. Finally, the classification accuracy achieved by CoDE-VAE is comparable to that of state-of-the-art multimodal VAE models.

Paperid:2316 poster

Authors:Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck

Title: No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets

Abstract:
Benchmark datasets have proved pivotal to the success of graph learning, andgoodbenchmark datasets are crucial to guide the development of the field. Recent research has highlighted problems with graph-learning datasets and benchmarking practices—revealing, for example, that methods which ignore the graph structure can outperform graph-based approaches. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work addresses these questions. As the classic evaluation setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes—graph structure and node features�? we introduce RINGS, a flexible and extensiblemode-perturbation frameworkto assess the quality of graph-learning datasets based ondataset ablations—i.e., quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two measures—performance separabilityandmode complementarity—as evaluation tools, each assessing the capacity of a graph dataset to benchmark the power and efficacy of graph-learning methods from a distinct angle. We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a step toward the systematicevaluation of evaluations.

Paperid:2317 poster

Authors:Daniel Franzen, Jan Disselhoff, David Hartmann

Title: Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

Abstract:
The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware.

Paperid:2318 poster

Authors:Haebin Shin, Lei Ji, Xiao Liu, Yeyun Gong

Title: Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Abstract:
Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.

Paperid:2319 poster

Authors:Haoxiong Liu, Jiacheng Sun, Zhenguo Li, Andrew C Yao

Title: ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis

Abstract:
The synergy between deep learning models and traditional automation tools, such as built-in tactics of the proof assistant and off-the-shelf automated theorem provers, plays a crucial role in developing robust and efficient neural theorem provers~(NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when explicitly invoked by the model or at a single granularity level, failing to fully exploit their power. To solve this issue, we propose ProofAug, a procedure that equips LLMs with automation methods at various granularities through fine-grained structure analysis of model-generated proof proposals. ProofAug also serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance. The superiority of our method is validated on the miniF2F benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version) with 2100 queries to the model per problem (In contrast, the previous SOTA in Isabelle, Subgoal-XL, only achieves 56.1% using 16384 queries per problem). We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-Preview-Distill-1.5B from 44.3% to 50.4% on miniF2F-test. Our code is available athttps://github.com/haoxiongliu/ProofAug.

Paperid:2320 poster

Authors:Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, Zaiwen Wen

Title: OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling

Abstract:
Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. The OptMATH dataset and related resources are available at \url{https://github.com/optsuite/OptMATH}.

Paperid:2321 poster

Authors:Rahul Choudhary, Hanbaek Lyu

Title: Linear convergence of Sinkhorn's algorithm for generalized static Schrödinger bridge

Abstract:
The classical static Schrödinger Bridge (SSB) problem, which seeks the most likely stochastic evolution between two marginal probability measures, has been studied extensively in the optimal transport and statistical physics communities, and more recently in machine learning communities in the surge of generative models. The standard approach to solve SSB is to first identify its Kantorovich dual and use Sinkhorn's algorithm to find the optimal potential functions. While the original SSB is only a strictly convex minimization problem, this approach is known to warrant linear convergence under mild assumptions. In this work, we consider a generalized SSB allowing any strictly increasing divergence functional, far generalizing the entropy functional $x\log x$ in the standard SSB. This problem naturally arises in a wide range of seemingly unrelated problems in entropic optimal transport, random graphs/matrices, and combinatorics. We establish Kantorovich duality and linear convergence of Sinkhorn's algorithm for the generalized SSB problem under mild conditions. Our results provide a new rigorous foundation for understanding Sinkhorn-type iterative methods in the context of large-scale generalized Schrödinger bridges.

Paperid:2322 poster

Authors:Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, Yi Ma

Title: SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Abstract:
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

Paperid:2323 poster

Authors:Teng Huang, Bin-Bin Jia, Min-Ling Zhang

Title: Towards Escaping from Class Dependency Modeling for Multi-Dimensional Classification

Abstract:
In multi-dimensional classification (MDC), the semantics of objects are characterized by multiple class variables from different dimensions. Existing MDC approaches focus on designing effective class dependency modeling strategies to enhance classification performance. However, the intercoupling of multiple class variables poses a significant challenge to the precise modeling of class dependencies. In this paper, we make the first attempt towards escaping from class dependency modeling for addressing MDC problems. Accordingly, a novel MDC approach named DCOM is proposed by decoupling the interactions of different dimensions in MDC. Specifically, DCOM endeavors to identify a latent factor that encapsulates the most salient and critical feature information. This factor will facilitate partial conditional independence among class variables conditioned on both the original feature vector and the learned latent embedding. Once the conditional independence is established, classification models can be readily induced by employing simple neural networks on each dimension. Extensive experiments conducted on benchmark data sets demonstrate that DCOM outperforms other state-of-the-art MDC approaches.

Paperid:2324 poster

Authors:Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, Long MA

Title: Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Abstract:
The GPT-4o's excellent duplex speech interaction ability has given users an impressive experience. Researchers have recently proposed several multimodal LLMs to achieve user-agent speech-to-speech conversations. In this paper, we propose a novel speech-text multimodal LLM architecture called Freeze-Omni, and our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level as that in the text modality of its backbone LLM while achieving low latency in the end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multitask training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.

Paperid:2325 poster

Authors:Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

Title: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Abstract:
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400×, end-to-end speedup of up to 3.7× as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme.

Paperid:2326 poster

Authors:Yung-Sung Chuang, Benjamin Cohen-Wang, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James R. Glass, Shang-Wen Li, Wen-tau Yih

Title: SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Abstract:
We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself throughcontext ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks. The source code is available athttps://github.com/facebookresearch/SelfCite.

Paperid:2327 poster

Authors:Shuo Han, German Espinosa, Junda Huang, Daniel Dombeck, Malcolm A. MacIver, Bradly C. Stadie

Title: Of Mice and Machines: A Comparison of Learning Between Real World Mice and RL Agents

Abstract:
Recent advances in reinforcement learning (RL) have demonstrated impressive capabilities in complex decision-making tasks. This progress raises a natural question: how do these artificial systems compare to biological agents, which have been shaped by millions of years of evolution? To help answer this question, we undertake a comparative study of biological mice and RL agents in a predator-avoidance maze environment. Through this analysis, we identify a striking disparity: RL agents consistently demonstrate a lack of self-preservation instinct, readily risking ``death'' for marginal efficiency gains. These risk-taking strategies are in contrast to biological agents, which exhibit sophisticated risk-assessment and avoidance behaviors. Towards bridging this gap between the biological and artificial, we propose two novel mechanisms that encourage more naturalistic risk-avoidance behaviors in RL agents. Our approach leads to the emergence of naturalistic behaviors, including strategic environment assessment, cautious path planning, and predator avoidance patterns that closely mirror those observed in biological systems.

Paperid:2328 poster

Authors:Jiawei Ge, Yuanhao Wang, Wenzhe Li, Chi Jin

Title: Securing Equal Share: A Principled Approach for Learning Multiplayer Symmetric Games

Abstract:
This paper examines multiplayer symmetric constant-sum games with more than two players in a competitive setting, such as Mahjong, Poker, and various board and video games. In contrast to two-player zero-sum games, equilibria in multiplayer games are neither unique nor non-exploitable, failing to provide meaningful guarantees when competing against opponents who play different equilibria or non-equilibrium strategies. This gives rise to a series of long-lasting fundamental questions in multiplayer games regarding suitable objectives, solution concepts, and principled algorithms. This paper takes an initial step towards addressing these challenges by focusing on the natural objective ofequal share—securing an expected payoff of $C/n$ in an $n$-player symmetric game with a total payoff of $C$. We rigorously identify the theoretical conditions under which achieving an equal share is tractable and design a series of efficient algorithms, inspired by no-regret learning, thatprovablyattain approximate equal share across various settings. Furthermore, we provide complementary lower bounds that justify the sharpness of our theoretical results. Our experimental results highlight worst-case scenarios where meta-algorithms from prior state-of-the-art systems for multiplayer games fail to secure an equal share, while our algorithm succeeds, demonstrating the effectiveness of our approach.

Paperid:2329 poster

Authors:Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, Hao Tang

Title: Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

Abstract:
In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To the best of our knowledge, Cavia is the first framework that enables users to generate multiple videos of the same scene with precise control over camera motion, while simultaneously preserving object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality.

Paperid:2330 poster

Authors:Katie Kang, Amrith Setlur, Dibya Ghosh, Jacob Steinhardt, Claire Tomlin, Sergey Levine, Aviral Kumar

Title: What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning?

Abstract:
Modern large language models (LLMs) excel at fitting finetuning data, but often struggle on unseen examples. In order to teach models genuine reasoning abilities rather than superficial pattern matching, our work aims to better understand how the learning dynamics of LLM finetuning shapes downstream generalization. Our analysis focuses on reasoning tasks, whose problem structure allows us to distinguish between memorization (the exact replication of reasoning steps from the training data) and performance (the correctness of the final solution). We find that a model's performance on test prompts can be effectively characterized by a training metric we call pre-memorization train accuracy: the accuracy of model samples on training queries before they begin to copy the exact reasoning steps from the training set. On the dataset level, this metric is able to almost perfectly predict test accuracy, achievingR2of�?.9across various models (Llama3 8B, Gemma2 9B), datasets (GSM8k, MATH), and training configurations. On a per-example level, this metric is also indicative of whether individual model predictions are robust to perturbations in the training query. By connecting a model's learning dynamics to test performance, pre-memorization train accuracy can inform training decisions, such as the makeup of the training data. Our experiments on data curation show that prioritizing examples with low pre-memorization accuracy leads to 1.5-2x improvements in data efficiency compared to i.i.d. data scaling and other data scaling techniques.

Paperid:2331 poster

Authors:Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji

Title: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Abstract:
Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants---whether humans or AI agents---to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state---what we term theout-of-syncchallenge---the collaborator's actions may fail, leading to integration issues. In this work, we introduceSyncMind, a framework that systematically defines theout-of-syncproblem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based onSyncMind, we createSyncBench, a benchmark featuring 24,332 instances of agentout-of-syncscenarios in real-world CSE derived from 21 popularGitHubrepositories with executable verification tests. Experiments onSyncBenchuncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (fromLlama-3.1agents�?.33toClaude-3.5-Sonnet�?8.18), their consistently low collaboration willingness (�?.86) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates without-of-syncrecovery success. Minimal performance differences in agents' resource-awareout-of-syncrecoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future development of resource-efficient collaborative systems. Our code and data are openly available on our project website:https://xhguo7.github.io/SyncMind/.

Paperid:2332 poster

Authors:Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis

Title: EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Abstract:
Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a ×7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models.

Paperid:2333 poster

Authors:Aditya Varre, Gizem Yüce, Nicolas Flammarion

Title: Learning In-contextn-grams with Transformers: Sub-n-grams Are Near-Stationary Points

Abstract:
In this article, we explore the loss landscape of next-token prediction with transformers. Specifically, we focus on learning in-context n-gram language models with cross-entropy loss using a simplified two-layer transformer. We design a series of transformers that representk-grams (fork≤n) for which the gradient of the population loss approaches zero in the limit of both infinite sequence length and infinite parameter norm. This construction reveals a key property of the loss landscape: \emph{k-grams are stationary points of the population cross-entropy loss}, offering theoretical insights for widely observed empirical phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by comprehensive numerical experiments that illustrate the dynamics of learningn-grams, characterized by jumps between stationary points.

Paperid:2334 poster

Authors:Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He

Title: Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Abstract:
Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models.Unlike previous works that often focus on merging models of the same kind, we propose merging modelsacross modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in atraining-freemanner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

Paperid:2335 poster

Authors:Samantha Chen, Pankaj K Agarwal, Yusu Wang

Title: De-coupled NeuroGF for Shortest Path Distance Approximations on Large Terrain Graphs

Abstract:
The ability to acquire high-resolution, large-scale geospatial data at an unprecedented using LiDAR and other related technologies has intensified the need for scalable algorithms for terrain analysis, includingshortest-path-distance(SPD) queries on large-scale terrain digital elevation models (DEMs). In this paper, we present aneural data structurefor efficiently answering SPD queries approximately on a large terrain DEM, which is based on the recently proposed neural geodesic field (NeuroGF) framework (Zhang et al., 2023)---the state-of-the-art neural data structure for estimating geodesic distance. In particular, we propose a decoupled-NeuroGF data structure combined with an efficient two-stage mixed-training strategy, which significantly reduces computational bottlenecks and enables efficient training on terrain DEMs at a scale not feasible before. We demonstrate the efficacy of our approach by performing detailed experiments on both synthetic and real data sets. For instance, we can train a small model with around 70000 parameters on a terrain DEM with 16 million nodes in a matter of hours that can answer SPD queries with 1% relative error in at most 10ms per query.

Paperid:2336 poster

Authors:Peimeng Guan, Mark A. Davenport

Title: SGD Jittering: A Training Strategy for Robust and Accurate Model-Based Architectures

Abstract:
Inverse problems aim to reconstruct unseen data from corrupted or perturbed measurements. While most work focuses on improving reconstruction quality, generalization accuracy and robustness are equally important, especially for safety-critical applications. Model-based architectures (MBAs), such as loop unrolling methods, are considered more interpretable and achieve better reconstructions. Empirical evidence suggests that MBAs are more robust to perturbations than black-box solvers, but the accuracy-robustness tradeoff in MBAs remains underexplored. In this work, we propose a simple yet effective training scheme for MBAs, called SGD jittering, which injects noise iteration-wise during reconstruction. We theoretically demonstrate that SGD jittering not only generalizes better than the standard mean squared error training but is also more robust to average-case attacks. We validate SGD jittering using denoising toy examples, seismic deconvolution, and single-coil MRI reconstruction. Both SGD jittering and its SPGD extension yield cleaner reconstructions for out-of-distribution data and demonstrates enhanced robustness against adversarial attacks.

Paperid:2337 poster

Authors:Guangyi Liu, Suzan Iloglu, Michael Caldara, Joseph W Durham, Michael M. Zavlanos

Title: Distributionally Robust Multi-Agent Reinforcement Learning for Dynamic Chute Mapping

Abstract:
In Amazon robotic warehouses, the destination-to-chute mapping problem is crucial for efficient package sorting. Often, however, this problem is complicated by uncertain and dynamic package induction rates, which can lead to increased package recirculation. To tackle this challenge, we introduce a Distributionally Robust Multi-Agent Reinforcement Learning (DRMARL) framework that learns a destination-to-chute mapping policy that is resilient to adversarial variations in induction rates. Specifically, DRMARL relies on group distributionally robust optimization (DRO) to learn a policy that performs well not only on average but also on each individual subpopulation of induction rates within the group that capture, for example, different seasonality or operation modes of the system. This approach is then combined with a novel contextual bandit-based estimator of the worst-case induction distribution for each state-action pair, significantly reducing the cost of exploration and thereby increasing the learning efficiency and scalability of our framework. Extensive simulations demonstrate that DRMARL achieves robust chute mapping in the presence of varying induction distributions, reducing package recirculation by an average of 80% in the simulation scenario.

Paperid:2338 poster

Authors:Ning Lu, Shengcai Liu, Jiahao Wu, Weiyu Chen, Zhirui Zhang, Yew-Soon Ong, Qi Wang, Ke Tang

Title: Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

Abstract:
Large language models (LLMs) have shown great potential as general-purpose AI assistants across various domains. To fully leverage this potential in specific applications, many companies provide fine-tuning API services, enabling users to upload their own data for LLM customization. However, fine-tuning services introduce a new safety threat: user-uploaded data, whether harmful or benign, can break the model’s alignment, leading to unsafe outputs. Moreover, existing defense methods struggle to address the diversity of fine-tuning datasets (e.g., varying sizes, tasks), often sacrificing utility for safety or vice versa. To address this issue, we propose Safe Delta, a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Specifically, Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss. Through extensive experiments on four diverse datasets with varying settings, our approach consistently preserves safety while ensuring that the utility gain from benign datasets remains unaffected.

Paperid:2339 poster

Authors:Yang Luo, Zangwei Zheng, Ziheng Qin, Zirui Zhu, Yong Liu, Yang You

Title: MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

Abstract:
Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models' large-batch training, due to the information bottleneck in attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens. This work highlights the importance of considering the max attention logit and finer-granularity trust ratio in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration of large language models. Code is available athttps://github.com/NUS-HPC-AI-Lab/MERIT.

Paperid:2340 poster

Authors:Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang

Title: SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Abstract:
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbedSparseVLMwithout extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, when LLaVA is equipped with SparseVLM, it achieves a 54% reduction in FLOPs, lowers CUDA time by 37%, and maintains an accuracy rate of 97%. Our code is available athttps://github.com/Gumpest/SparseVLMs.

Paperid:2341 poster

Authors:Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu

Title: Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Abstract:
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce \emph{preference embedding}, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available athttps://github.com/general-preference/general-preference-model.

Paperid:2342 poster

Authors:Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo

Title: Collapse or Thrive: Perils and Promises of Synthetic Data in a Self-Generating World

Abstract:
What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of “model collapse�?as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Paperid:2343 poster

Authors:Jincheng Huang, Yujie Mo, Xiaoshuang Shi, Lei Feng, Xiaofeng Zhu

Title: Enhancing the Influence of Labels on Unlabeled Nodes in Graph Convolutional Networks

Abstract:
The message-passing mechanism of graph convolutional networks (i.e., GCNs) enables label information to reach more unlabeled neighbors, thereby increasing the utilization of labels. However, the additional label information does not always contribute positively to the GCN. To address this issue, we propose a new two-step framework called ELU-GCN. In the first stage, ELU-GCN conducts graph learning to learn a new graph structure (i.e., ELU-graph), which allows the additional label information to positively influence the predictions of GCN. In the second stage, we design a new graph contrastive learning on the GCN framework for representation learning by exploring the consistency and mutually exclusive information between the learned ELU graph and the original graph. Moreover, we theoretically demonstrate that the proposed method can ensure the generalization ability of GCNs. Extensive experiments validate the superiority of our method.

Paperid:2344 poster

Authors:Jiahua Rao, Dahao Xu, Wentao Wei, Yicong Chen, Mingjun Yang, Yuedong Yang

Title: Quadruple Attention in Many-body Systems for Accurate Molecular Property Predictions

Abstract:
While Graph Neural Networks and Transformers have shown promise in predicting molecular properties, they struggle with directly modeling complex many-body interactions. Current methods often approximate interactions like three- and four-body terms in message passing, while attention-based models, despite enabling direct atom communication, are typically limited to triplets, making higher-order interactions computationally demanding. To address the limitations, we introduce MABNet, a geometric attention framework designed to model four-body interactions by facilitating direct communication among atomic quartets. This approach bypasses the computational bottlenecks associated with traditional triplet-based attention mechanisms, allowing for the efficient handling of higher-order interactions. MABNet achieves state-of-the-art performance on benchmarks like MD22 and SPICE. These improvements underscore its capability to accurately capture intricate many-body interactions in large molecules. By unifying rigorous many-body physics with computational efficiency, MABNet advances molecular simulations for applications in drug design and materials discovery, while its extensible framework paves the way for modeling higher-order quantum effects.

Paperid:2345 poster

Authors:Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer

Title: No Task Left Behind: Isotropic Model Merging with Common and Task-Specific Subspaces

Abstract:
Model merging integrates the weights of multiple task-specific models into a single multi-task model. Despite recent interest in the problem, a significant performance gap between the combined and single-task models remains. In this paper, we investigate the key characteristics of task matrices -- weight update matrices applied to a pre-trained model -- that enable effective merging. We show that alignment between singular components of task-specific and merged matrices strongly correlates with performance improvement over the pre-trained model. Based on this, we propose an isotropic merging framework that flattens the singular value spectrum of task matrices, enhances alignment, and reduces the performance gap. Additionally, we incorporate both common and task-specific subspaces to further improve alignment and performance. Our proposed approach achieves state-of-the-art performance on vision and language tasks across various sets of tasks and model scales. This work advances the understanding of model merging dynamics, offering an effective methodology to merge models without requiring additional training.

Paperid:2346 poster

Authors:Lokesh Veeramacheneni, Moritz Wolter, Hilde Kuehne, Juergen Gall

Title: Canonical Rank Adaptation: An Efficient Fine-Tuning Strategy for Vision Transformers

Abstract:
Modern methods for fine-tuning a Vision Transformer (ViT) like Low-Rank Adaptation (LoRA) and its variants demonstrate impressive performance. However, these methods ignore the high-dimensional nature of Multi-Head Attention (MHA) weight tensors. To address this limitation, we propose Canonical Rank Adaptation (CaRA). CaRA leverages tensor mathematics, first by tensorising the transformer into two different tensors; one for projection layers in MHA and the other for feed-forward layers. Second, the tensorised formulation is fine-tuned using the low-rank adaptation in Canonical-Polyadic Decomposition (CPD) form. Employing CaRA efficiently minimizes the number of trainable parameters. Experimentally, CaRA outperforms existing Parameter-Efficient Fine-Tuning (PEFT) methods in visual classification benchmarks such as Visual Task Adaptation Benchmark (VTAB)-1k and Fine-Grained Visual Categorization (FGVC).

Paperid:2347 poster

Authors:Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang

Title: BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Our framework addresses two critical questions: (1) how to generate high-quality reasoning processes during inference automatically, and (2) how to integrate these processes into post-training. We propose the \emph{Bootstrapping Reinforced Thinking Process} (BRiTE) algorithm and demonstrate its theoretical convergence at a rate of $1/T$, where $T$ is the number of iterations. The algorithm operates in two steps. First, it generates high-quality rationales by approximating the desired posterior distribution using a reinforcement learning approach with a novel reward shaping mechanism. Second, it fine-tunes the base LLM by maximizing the joint probability of rationale generation with respect to LLM parameters. Empirical evaluation on GSM8K and MATH benchmarks demonstrates that our approach consistently improves performance across different model sizes without requiring human-annotated thinking processes, outperforming standard chain-of-thought prompting while enhancing existing post-training methods.

Paperid:2348 poster

Authors:Min-Yeong Park, Won-Jeong Lee, Seong Tae Kim, Gyeong-Moon Park

Title: When Will It Fail?: Anomaly to Prompt for Forecasting Future Anomalies in Time Series

Abstract:
Recently, forecasting future abnormal events has emerged as an important scenario to tackle realworld necessities. However, the solution of predicting specific future time points when anomalies will occur, known as Anomaly Prediction (AP), remains under-explored. Existing methods dealing with time series data fail in AP, focusing only on immediate anomalies or failing to provide precise predictions for future anomalies. To address AP, we propose a novel framework called Anomaly to Prompt (A2P), comprised of Anomaly-Aware Forecasting (AAF) and Synthetic Anomaly Prompting (SAP). To enable the forecasting model to forecast abnormal time points, we adopt a strategy to learn the relationships of anomalies. For the robust detection of anomalies, our proposed SAP introduces a learnable Anomaly Prompt Pool (APP) that simulates diverse anomaly patterns using signal-adaptive prompt. Comprehensive experiments on multiple real-world datasets demonstrate the superiority of A2P over state-of-the-art methods, showcasing its ability to predict future anomalies.

Paperid:2349 poster

Authors:Dominik Fuchsgruber, Tom Wollschläger, Johannes Bordne, Stephan Günnemann

Title: Uncertainty Estimation for Heterophilic Graphs Through the Lens of Information Theory

Abstract:
While uncertainty estimation for graphs recently gained traction, most methods rely on homophily and deteriorate in heterophilic settings. We address this by analyzing message passing neural networks from an information-theoretic perspective and developing a suitable analog to data processing inequality to quantify information throughout the model's layers. In contrast to non-graph domains, information about the node-level prediction target canincreasewith model depth if a node's features are semantically different from its neighbors. Therefore, on heterophilic graphs, the latent embeddings of an MPNN each provide different information about the data distribution - different from homophilic settings. This reveals that considering all node representations simultaneously is a key design principle for epistemic uncertainty estimation on graphs beyond homophily. We empirically confirm this with a simple post-hoc density estimator on the joint node embedding space that provides state-of-the-art uncertainty on heterophilic graphs. At the same time, it matches prior work on homophilic graphs without explicitly exploiting homophily through post-processing.

Paperid:2350 poster

Authors:Alessandro Pierro, Steven Abreu, Jonathan Timcheck, Philipp Stratmann, Andreas Wild, Sumit Bam Shrestha

Title: Accelerating Linear Recurrent Neural Networks for the Edge with Unstructured Sparsity

Abstract:
Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. Unstructured sparsity offers a compelling solution, enabling substantial reductions in compute and memory requirements--when accelerated by compatible hardware platforms. In this paper, we conduct a scaling study to investigate the Pareto front of performance and efficiency across inference compute budgets. We find that highly sparse linear RNNsconsistentlyachieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$% less memory at iso-accuracy. Our models achieve state-of-the-art results on a real-time streaming task for audio denoising. By quantizing our sparse models to fixed-point arithmetic and deploying them on the Intel Loihi 2 neuromorphic chip for real-time processing, we translate model compression into tangible gains of $42\times$ lower latency and $149\times$ lower energy consumption compared to a dense model on an edge GPU. Our findings showcase the transformative potential of unstructured sparsity, paving the way for highly efficient recurrent neural networks in real-world, resource-constrained environments.

Paperid:2351 poster

Authors:Xinghe Fu, Zhiyuan Yan, Zheng Yang, Taiping Yao, Yandan Zhao, Shouhong Ding, Xi Li

Title: PiD: Generalized AI-Generated Images Detection with Pixelwise Decomposition Residuals

Abstract:
Fake images, created by recently advanced generative models, have become increasingly indistinguishable from real ones, making their detection crucial, urgent, and challenging. This paper introduces PiD (Pixelwise Decomposition Residuals), a novel detection method that focuses on residual signals within images. Generative models are designed to optimize high-level semantic content (principal components), often overlooking low-level signals (residual components). PiD leverages this observation by disentangling residual components from images, encouraging the model to uncover more underlying and general forgery clues independent of semantic content. Compared to prior approaches that rely on reconstruction techniques or high-frequency information, PiD is computationally efficient and does not rely on any generative models for reconstruction. Specifically, PiD operates at the pixel level, mapping the pixel vector to another color space (e.g., YUV) and then quantizing the vector. The pixel vector is mapped back to the RGB space and the quantization loss is taken as the residual for AIGC detection. Our experiment results are striking and highly surprising: PiD achieves 98% accuracy on the widely used GenImage benchmark, highlighting the effectiveness and generalization performance.

Paperid:2352 poster

Authors:Hengquan Guo, Lingkai Zu, Xin Liu

Title: Triple-Optimistic Learning for Stochastic Contextual Bandits with General Constraints

Abstract:
We study contextual bandits with general constraints, where a learner observes contexts and aims to maximize cumulative rewards while satisfying a wide range of general constraints. We introduce the Optimistic$^3$ framework, a novel learning and decision-making approach that integrates optimistic design into parameter learning, primal decision, and dual violation adaptation (i.e., triple-optimism), combined with an efficient primal-dual architecture. Optimistic$^3$ achieves $\tilde{O}(\sqrt{T})$ regret and constraint violation for contextual bandits with general constraints. This framework not only outperforms the state-of-the-art results that achieve $\tilde{O}(T^{\frac{3}{4}})$ guarantees when Slater's condition does not hold but also improves on previous results that achieve $\tilde{O}(\sqrt{T}/\delta)$ when Slater's condition holds ($\delta$ denotes the Slater's condition parameter), offering a $O(1/\delta)$ improvement. Note this improvement is significant because $\delta$ can be arbitrarily small when constraints are particularly challenging. Moreover, we show that Optimistic$^3$ can be extended to classical multi-armed bandits with both stochastic and adversarial constraints, recovering the best-of-both-worlds guarantee established in the state-of-the-art works, but with significantly less computational overhead.

Paperid:2353 poster

Authors:Nuoya Xiong, Aarti Singh

Title: Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Abstract:
Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning models, particularly Language Models (LMs) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices, e.g. compare two papers on their novelty, clarity, correctness, etc. Multi-Objective RLHF aims to use per-objective preference feedback and achieve a Pareto optimal tradeoff among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.

Paperid:2354 poster

Authors:Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi

Title: Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Abstract:
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one---typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts---chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Paperid:2355 poster

Authors:Xiyue Zhu, Dou Hoon Kwark, Ruike Zhu, Kaiwen Hong, Yiqi Tao, Shirui Luo, Yudu Li, Zhi-Pei Liang, Volodymyr Kindratenko

Title: Introducing 3D Representation for Dense Volume-to-Volume Translation via Score Fusion

Abstract:
In volume-to-volume translations in medical images, existing models often struggle to capture the inherent volumetric distribution using 3D voxel-space representations, due to high computational dataset demands. We present Score-Fusion, a novel volumetric translation model that effectively learns 3D representations by ensembling perpendicularly trained 2D diffusion models in score function space. By carefully initializing our model to start with an average of 2D models as in existing models, we reduce 3D training to a fine-tuning process, mitigating computational and data demands. Furthermore, we explicitly design the 3D model's hierarchical layers to learn ensembles of 2D features, further enhancing efficiency and performance. Moreover, Score-Fusion naturally extends to multi-modality settings by fusing diffusion models conditioned on different inputs for flexible, accurate integration. We demonstrate that 3D representation is essential for better performance in downstream recognition tasks, such as tumor segmentation, where most segmentation models are based on 3D representation. Extensive experiments demonstrate that Score-Fusion achieves superior accuracy and volumetric fidelity in 3D medical image super-resolution and modality translation. Additionally, we extend Score-Fusion to video super-resolution by integrating 2D diffusion models on time-space slices with a spatial-temporal video diffusion backbone, highlighting its potential for general-purpose volume translation and providing broader insight into learning-based approaches for score function fusion.

Paperid:2356 poster

Authors:Zhonglin Cao, Mario Geiger, Allan Dos Santos Costa, Danny Reidenbach, Karsten Kreis, Tomas Geffner, Franco Pellegrini, Guoqing Zhou, Emine Kucukbenli

Title: Efficient Molecular Conformer Generation with SO(3)-Averaged Flow Matching and Reflow

Abstract:
Fast and accurate generation of molecular conformers is desired for downstream computational chemistry and drug discovery tasks. Currently, training and sampling state-of-the-art diffusion or flow-based models for conformer generation require significant computational resources. In this work, we build upon flow-matching and propose two mechanisms for accelerating training and inference of generative models for 3D molecular conformer generation. For fast training, we introduce the SO(3)-Averaged Flowtraining objective, which leads to faster convergence to better generation quality compared to conditional optimal transport flow or Kabsch-aligned flow. We demonstrate that models trained using SO(3)-Averaged Flowcan reach state-of-the-art conformer generation quality. For fast inference, we show that the reflow and distillation methods of flow-based models enable few-steps or even one-step molecular conformer generation with high quality. The training techniques proposed in this work show a path towards highly efficient molecular conformer generation with flow-based models.

Paperid:2357 poster

Authors:Jie Gao, Rajesh Jayaram, Benedikt Kolbe, Shay Sapir, Chris Schwiegelshohn, Sandeep Silwal, Erik Waingarten

Title: Randomized Dimensionality Reduction for Euclidean Maximization and Diversity Measures

Abstract:
Randomized dimensionality reduction is a widely-used algorithmic technique for speeding up large-scale Euclidean optimization problems. In this paper, we study dimension reduction for a variety of maximization problems, including max-matching, max-spanning tree, as well as various measures for dataset diversity. For these problems, we show that the effect of dimension reduction is intimately tied to thedoubling dimension$\lambda_X$ of the underlying dataset $X$---a quantity measuring intrinsic dimensionality of point sets. Specifically, the dimension required is $O(\lambda_X)$, which we also show is necessary for some of these problems. This is in contrast to classical dimension reduction results, whose dependence grow with the dataset size $|X|$. We also provide empirical results validating the quality of solutions found in the projected space, as well as speedups due to dimensionality reduction.

Paperid:2358 poster

Authors:Ziyad Benomar, Lorenzo Croissant, Vianney Perchet, Spyros Angelopoulos

Title: Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Abstract:
One-max search is a classic problem in online decision-making, in which a trader acts on a sequence of revealed prices and accepts one of them irrevocably to maximise its profit. The problem has been studied both in probabilistic and in worst-case settings, notably through competitive analysis, and more recently in learning-augmented settings in which the trader has access to a prediction on the sequence. However, existing approaches either lack smoothness, or do not achieve optimal worst-case guarantees: they do not attain the best possible trade-off between the consistency and the robustness of the algorithm. We close this gap by presenting the first algorithm that simultaneously achieves both of these important objectives. Furthermore, we show how to leverage the obtained smoothness to provide an analysis of one-max search in stochastic learning-augmented settings which capture randomness in both the observed prices and the prediction.

Paperid:2359 poster

Authors:Varshita Kolipaka, Akshit Sinha, Debangan Mishra, Sumit Kumar, Arvindh Arun, Shashwat Goel, Ponnurangam Kumaraguru

Title: A Cognac Shot To Forget Bad Memories: Corrective Unlearning for Graph Neural Networks

Abstract:
Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. Because graph data does not follow the independently and identically distributedi.i.d.assumption, adversarial manipulations or incorrect data can propagate to other data points through message passing, which deteriorates the model's performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem ofCorrective Unlearning. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method,Cognac, which can unlearn the effect of the manipulation set even when only $5$% of it is identified. It recovers most of the performance of a strong oracle with fully corrected training data, even beating retraining from scratch without the deletion set, and is $8$x more efficient while also scaling to large datasets. We hope our work assists GNN developers in mitigating harmful effects caused by issues in real-world data, post-training.

Paperid:2360 poster

Authors:Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu

Title: VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Abstract:
Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a "free lunch" for TSF and highlight the potential for future cross-modality research. Our code is available in thehttps://github.com/Keytoyze/VisionTS.

Paperid:2361 poster

Authors:Siqi Kou, Jiachun Jin, Zhihong Liu, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng

Title: Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Abstract:
We introduce Orthus, a unified multimodal model that excels in generating interleaved images and text from mixed-modality inputs by simultaneously handling discrete text tokens and continuous image features under the \textbf{AR} modeling principle. The continuous treatment of visual signals minimizes the information loss while the fully AR formulation renders the characterization of the correlation between modalities straightforward. Orthus leverages these advantages through its modality-specific heads---one regular language modeling (LM) head predicts discrete text tokens and one diffusion head generates continuous image features. We devise an efficient strategy for building Orthus---by substituting the Vector Quantization (VQ) operation in the existing unified AR model with a soft alternative, introducing a diffusion head, and tuning the added modules to reconstruct images, we can create an Orthus-base model effortlessly (e.g., within 72 A100 GPU hours). Orthus-base can further embrace post-training to craft lengthy interleaved image-text, reflecting the potential for handling intricate real-world tasks. For visual understanding and generation, Orthus achieves a GenEval score of 0.58 and an MME-P score of 1265.8 using 7B parameters, outperforming competing baselines including Show-o and Chameleon.

Paperid:2362 poster

Authors:Siddharth Gollapudi, Ravishankar Krishnaswamy, Kirankumar Shiragur, Harsh Wardhan

Title: Sort Before You Prune: Improved Worst-Case Guarantees of the DiskANN Family of Graphs

Abstract:
Graph-based data structures have become powerful and ubiquitous tools for scalable approximate nearest-neighbor (ANN) search over the past decade. In spite of their apparent practical performance, there has only recently been progress on theworst-caseperformance of these data structures. Indeed, the influential work of Indyx and Xu (2023) introduced the key concept of $\alpha$-reachable graphs, showing that graphs constructed by the DiskANN algorithm (Subramanya, et. al. 2023) produce an $\left(\frac{\alpha+1}{\alpha-1}\right)$-approximate solution with a simple best-first search that runs in poly-logarithmic query time. In our work, we improve and generalize this analysis as follows: - We introducesorted$\alpha$-reachable graphs, and use this notion to obtain a stronger approximation factor of $\frac{\alpha}{\alpha-1}$ for the DiskANN algorithm on Euclidean metrics. - We present thefirstworst-case theoretical analysis for the popularbeam-searchalgorithm, which is used in practice to search these graphs for $k > 1$ candidate nearest neighbors. We also present empirical results validating the significance of sorted $\alpha$-reachable graphs, which aligns with our theoretical findings.

Paperid:2363 poster

Authors:Yuntao Wang, Yuxuan Li, Qingyuan Yang, Hu Ding

Title: Finding Wasserstein Ball Center: Efficient Algorithm and The Applications in Fairness

Abstract:
Wasserstein Barycenter (WB) is a fundamental geometric optimization problem in machine learning, whose objective is to find a representative probability measure that minimizes the sum of Wasserstein distances to given distributions. WB has a number of applications in various areas. However, WB may lead to unfair outcome towards underrepresented groups in some applications (e.g., a "minority'' distribution may be far away from the obtained WB under Wasserstein distance). To address this issue, we propose an alternative objective called "Wasserstein Ball Center (WBC)''. Specifically, WBC is a distribution that encompasses all input distributions within the minimum Wasserstein distance, which can be formulated as a ``minmax'' optimization problem. We show that the WBC problem with fixed support is equivalent to solving a large-scale linear programming (LP) instance, which is quite different from the previously studied LP model for WB. By incorporating some novel observations on the induced normal equation, we propose an efficient algorithm that accelerates the interior point method by $O(\min(N^2m, Nm^2, m^4))$ times ("$N$'' is the number of distributions and "$m$'' is the support size). Finally, we conduct a set of experiments on both synthetic and real-world datasets, demonstrating the computational efficiency of our algorithm, and showing its ability to provide more fairness for input distributions.

Paperid:2364 poster

Authors:Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, Haohan Wang

Title: Revolve: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

Abstract:
Recent advancements in large language models (LLMs) have significantly enhanced the ability of LLM-based systems to perform complex tasks through natural language processing and tool interaction. However, optimizing these LLM-based systems for specific tasks remains challenging, often requiring manual interventions like prompt engineering and hyperparameter tuning. Existing automatic optimization methods, such as textual feedback-based techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to using immediate derivatives in traditional numerical gradient descent. However, relying solely on such feedback can be limited when the adjustments made in response to this feedback are either too small or fluctuate irregularly, potentially slowing down or even stalling the optimization process. In this paper, we introduceREVOLVE, an optimization method that tracks howResponsesEVOLVEacross iterations in LLM systems. By focusing on the evolution of responses over time, REVOLVE enables more stable and effective optimization by making thoughtful, progressive adjustments at each step. Experiments across three tasks demonstrate the adaptability and efficiency of our proposal. Beyond its practical contributions, REVOLVE highlights a promising direction, where the rich knowledge from established optimization principles can be leveraged to enhance LLM systems, which paves the way for further advancements in this hybrid domain. Code is available at:https://llm-revolve.netlify.app.

Paperid:2365 poster

Authors:Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, Remi Munos

Title: Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

Abstract:
In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with $k$ samples, with focus on pass@$k$ and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@$k$ objectives compared to the baseline method.

Paperid:2366 poster

Authors:Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, Zongzhang Zhang

Title: Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning

Abstract:
Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.

Paperid:2367 poster

Authors:Zhuohao Yu, Weizheng Gu, Yidong Wang, Xingru Jiang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang

Title: Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Abstract:
Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps. We introduceOutcomeRefiningProcessSupervision, which unifies process and outcome supervision by leveraging executable verification: a tree-structured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning. Experiments across 5 models and 3 benchmarks show consistent gains, with26.9%higher correctness and42.2%improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges.

Paperid:2368 poster

Authors:Taeyoung Yun, Kiyoung Om, Jaewoo Lee, Sujin Yun, Jinkyoo Park

Title: Posterior Inference with Diffusion Models for High-dimensional Black-box Optimization

Abstract:
Optimizing high-dimensional and complex black-box functions is crucial in numerous scientific applications. While Bayesian optimization (BO) is a powerful method for sample-efficient optimization, it struggles with the curse of dimensionality and scaling to thousands of evaluations. Recently, leveraging generative models to solve black-box optimization problems has emerged as a promising framework. However, those methods often underperform compared to BO methods due to limited expressivity and difficulty of uncertainty estimation in high-dimensional spaces. To overcome these issues, we introduce \textbf{DiBO}, a novel framework for solving high-dimensional black-box optimization problems. Our method iterates two stages. First, we train a diffusion model to capture the data distribution and deep ensembles to predict function values with uncertainty quantification. Second, we cast the candidate selection as a posterior inference problem to balance exploration and exploitation in high-dimensional spaces. Concretely, we fine-tune diffusion models to amortize posterior inference. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines across synthetic and real-world tasks. Our code is publicly available \href{https://github.com/umkiyoung/DiBO}{here}.

Paperid:2369 poster

Authors:Wenke Huang, Jian Liang, Zekun Shi, Didi Zhu, Guancheng Wan, He Li, Bo Du, Dacheng Tao, Mang Ye

Title: Learn from Downstream and Be Yourself in Multimodal Large Language Models Fine-Tuning

Abstract:
Multimodal Large Language Model (MLLM) has demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pre-training datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.

Paperid:2370 poster

Authors:Peiyan Hu, Xiaowei Qian, Wenhao Deng, Rui Wang, Haodong Feng, Ruiqi Feng, Tao Zhang, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu

Title: From Uncertain to Safe: Conformal Adaptation of Diffusion Models for Safe PDE Control

Abstract:
The application of deep learning for partial differential equation (PDE)-constrained control is gaining increasing attention. However, existing methods rarely consider safety requirements crucial in real-world applications. To address this limitation, we propose Safe Diffusion Models for PDE Control (SafeDiffCon), which introduce the uncertainty quantile as model uncertainty quantification to achieve optimal control under safety constraints through both post-training and inference phases. Firstly, our approach post-trains a pre-trained diffusion model to generate control sequences that better satisfy safety constraints while achieving improved control objectives via a reweighted diffusion loss, which incorporates the uncertainty quantile estimated using conformal prediction. Secondly, during inference, the diffusion model dynamically adjusts both its generation process and parameters through iterative guidance and fine-tuning, conditioned on control targets while simultaneously integrating the estimated uncertainty quantile. We evaluate SafeDiffCon on three control tasks: 1D Burgers' equation, 2D incompressible fluid, and controlled nuclear fusion problem. Results demonstrate that SafeDiffCon is the only method that satisfies all safety constraints, whereas other classical and deep learning baselines fail. Furthermore, while adhering to safety constraints, SafeDiffCon achieves the best control performance. The code can be found athttps://github.com/AI4Science-WestlakeU/safediffcon.

Paperid:2371 poster

Authors:Bhavna Gopal, Huanrui Yang, Jingyang Zhang, Mark Horton, Yiran Chen

Title: Boosting Adversarial Robustness with CLAT: Criticality Leveraged Adversarial Training

Abstract:
Adversarial training (AT) enhances neural network robustness. Typically, AT updates all trainable parameters, but can lead to overfitting and increased errors on clean data. Research suggests that fine-tuning specific parameters may be more effective; however, methods for identifying these essential parameters and establishing effective optimization objectives remain inadequately addressed. We present CLAT, an innovative adversarial fine-tuning algorithm that mitigates adversarial overfitting by integrating "criticality" into the training process. Instead of tuning the entire model, CLAT identifies and fine-tunes fewer parameters in robustness-critical layers—those predominantly learning non-robust features—while keeping the rest of the model fixed. Additionally, CLAT employs a dynamic layer selection process that adapts to changes in layer criticality during training. Empirical results demonstrate that CLAT can be seamlessly integrated with existing adversarial training methods, enhancing clean accuracy and adversarial robustness by over 2% compared to baseline approaches.

Paperid:2372 poster

Authors:Xiaoyu Wu, Jiaru Zhang, Steven Wu

Title: Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models

Abstract:
Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: "Can training data be extracted from these fine-tuned DMs shared online?" A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model's learned distribution---from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20% of fine-tuning data in most cases, significantly surpassing baseline performance. The code is available.

Paperid:2373 poster

Authors:Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei

Title: Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation

Abstract:
Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters' activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter's marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available athttps://github.com/zwebzone/coto.

Paperid:2374 poster

Authors:Jiahai Feng, Stuart Russell, Jacob Steinhardt

Title: Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Abstract:
Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on "John Doe lives in Tokyo," LMs correctly answer "What language do the people in John Doe's city speak?'' with "Japanese''. However, little is known about the mechanisms that enable this generalization or how they are learned during pretraining. We introduce extractive structures as a framework for describing how components in LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication. We hypothesize that extractive structures are learned during pretraining when encountering implications of previously known facts. This yields two predictions: a data ordering effect where extractive structures can be learned only if facts precede their implications, and a weight grafting effect where extractive structures can be grafted to predict counterfactual implications. We empirically show these effects in the OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of generalization.

Paperid:2375 poster

Authors:Jun-En Ding, Dongsheng Luo, Chenwei Wu, Feng Liu

Title: NeuroTree: Hierarchical Functional Brain Pathway Decoding for Mental Health Disorders

Abstract:
Mental disorders are among the most widespread diseases globally. Analyzing functional brain networks through functional magnetic resonance imaging (fMRI) is crucial for understanding mental disorder behaviors. Although existing fMRI-based graph neural networks (GNNs) have demonstrated significant potential in brain network feature extraction, they often fail to characterize complex relationships between brain regions and demographic information in mental disorders. To overcome these limitations, we propose a learnable NeuroTree framework that integrates a $k$-hop AGE-GCN with neural ordinary differential equations (ODEs) and contrastive masked functional connectivity (CMFC) to enhance similarities and dissimilarities of brain region distance. Furthermore, NeuroTree effectively decodes fMRI network features into tree structures, which improves the capture of high-order brain regional pathway features and enables the identification of hierarchical neural behavioral patterns essential for understanding disease-related brain subnetworks. Our empirical evaluations demonstrate that NeuroTree achieves state-of-the-art performance across two distinct mental disorder datasets. It provides valuable insights into age-related deterioration patterns, elucidating their underlying neural mechanisms. The code and datasets are available athttps://github.com/Ding1119/NeuroTree.

Paperid:2376 poster

Authors:Talal Widatalla, Richard W. Shuai, Brian Hie, Possu Huang

Title: Sidechain conditioning and modeling for full-atom protein sequence design with FAMPNN

Abstract:
Leading deep learning-based methods for fixed-backbone protein sequence design do not model protein sidechain conformation during sequence generation despite the large role the three-dimensional arrangement of sidechain atoms play in protein conformation, stability, and overall protein function. Instead, these models implicitly reason about crucial sidechain interactions based solely on backbone geometry and amino-acid sequence. To address this, we present FAMPNN (Full-Atom MPNN), a sequence design method that explicitly models both sequence identity and sidechain conformation for each residue, where the per-token distribution of a residue’s discrete amino acid identity and its continuous sidechain conformation are learned with a combined categorical cross-entropy and diffusion loss objective. We demonstrate learning these distributions jointly is a highly synergistic task that both improves sequence recovery while achieving state-of-the-art sidechain packing. Furthermore, benefits from explicit full-atom modeling generalize from sequence recovery to practical protein design applications, such as zero-shot prediction of experimental binding and stability measurements.

Paperid:2377 poster

Authors:Yushuai Li, Hengyu Liu, Torben Bach Pedersen, Yuqiang He, Kim Guldstrand Larsen, Lu Chen, Christian S. Jensen, Jiachen Xu, TIANYI LI

Title: RobustZero: Enhancing MuZero Reinforcement Learning Robustness to State Perturbations

Abstract:
The MuZero reinforcement learning method has achieved superhuman performance at games, and advances that enable MuZero to contend with complex actions now enable use of MuZero-class methods in real-world decision-making applications. However, some real-world applications are susceptible to state perturbations caused by malicious attacks and noisy sensors. To enhance the robustness of MuZero-class methods to state perturbations, we propose RobustZero, the first MuZero-class method that isrobust―to worst-case and random-case state perturbations, withzero―prior knowledge of the environment’s dynamics. We present a training framework for RobustZero that features a self-supervised representation network, targeting the generation of a consistent initial hidden state, which is key to obtain consistent policies before and after state perturbations, and it features a unique loss function that facilitates robustness. We present an adaptive adjustment mechanism to enable model update, enhancing robustness to both worst-case and random-case state perturbations. Experiments on two classical control environments, three energy system environments, three transportation environments, and four Mujoco environments demonstrate that RobustZero can outperform state-of-the-art methods at defending against state perturbations.

Paperid:2378 poster

Authors:Lars van der Laan, Ahmed Alaa

Title: Generalized Venn and Venn-Abers Calibration with Applications in Conformal Prediction

Abstract:
Ensuring model calibration is critical for reliable prediction, yet popular distribution-free methods such as histogram binning and isotonic regression offer only asymptotic guarantees. We introduce a unified framework for Venn and Venn-Abers calibration that extends Vovk's approach beyond binary classification to a broad class of prediction tasks defined by generic loss functions. Our method transforms any perfectly in-sample calibrated predictor into a set-valued predictor that, in finite samples, outputs at least one marginally calibrated point prediction. These set predictions shrink asymptotically and converge to a conditionally calibrated prediction, capturing epistemic uncertainty. We further propose Venn multicalibration, a new approach for achieving finite-sample calibration across subpopulations. For quantile loss, our framework recovers group-conditional and multicalibrated conformal prediction as special cases and yields novel prediction intervals with quantile-conditional coverage.

Paperid:2379 poster

Authors:Long Ma, Fangwei Zhong, Yizhou Wang

Title: Behavior-agnostic Task Inference for Robust Offline In-context Reinforcement Learning

Abstract:
The ability to adapt to new environments with noisy dynamics and unseen objectives is crucial for AI agents. In-context reinforcement learning (ICRL) has emerged as a paradigm to build adaptive policies, employing acontexttrajectory of the test-time interactions to infer the true task and the corresponding optimal policy efficiently without gradient updates. However, ICRL policies heavily rely on context trajectories, making them vulnerable to distribution shifts from training to testing and degrading performance, particularly in offline settings where the training data is static. In this paper, we highlight that most existing offline ICRL methods are trained for approximate Bayesian inference based on the training distribution, rendering them vulnerable to distribution shifts at test time and resulting in poor generalization. To address this, we introduce Behavior-agnostic Task Inference (BATI) for ICRL, a model-based maximum-likelihood solution to infer the task representation robustly. In contrast to previous methods that rely on a learned encoder as the approximate posterior, BATI focuses purely on dynamics, thus insulating itself against the behavior of the context collection policy. Experiments on MuJoCo environments demonstrate that BATI effectively interprets out-of-distribution contexts and outperforms other methods, even in the presence of significant environmental noise.

Paperid:2380 poster

Authors:Kyurae Kim, Zuheng Xu, Jacob R. Gardner, Trevor Campbell

Title: Tuning Sequential Monte Carlo Samplers via Greedy Incremental Divergence Minimization

Abstract:
The performance of sequential Monte Carlo (SMC) samplers heavily depends on the tuning of the Markov kernels used in the path proposal. For SMC samplers with unadjusted Markov kernels, standard tuning objectives, such as the Metropolis-Hastings acceptance rate or the expected-squared jump distance, are no longer applicable. While stochastic gradient-based end-to-end optimization algorithms have been explored for tuning SMC samplers, they often incur excessive training costs, even for tuning just the kernel step sizes. In this work, we propose a general adaptation framework for tuning the Markov kernels in SMC samplers by minimizing the incremental Kullback-Leibler (KL) divergence between the proposal and target paths. For step size tuning, we provide a gradient- and tuning-free algorithm that is generally applicable for kernels such as Langevin Monte Carlo (LMC). We further demonstrate the utility of our approach by providing a tailored scheme for tuning kinetic LMC used in SMC samplers. Our implementations are able to obtain a full schedule of tuned parameters at the cost of a few vanilla SMC runs, which is a fraction of gradient-based approaches.

Paperid:2381 poster

Authors:Aleksandr Kolomeitsev, ANH-HUY PHAN

Title: Feature-Mapping Topology Optimization with Neural Heaviside Signed Distance Functions

Abstract:
Topology optimization plays a crucial role in designing efficient and manufacturable structures. Traditional methods often yield free-form voids that, although providing design flexibility, introduce significant manufacturing challenges and require extensive post-processing. Conversely, feature-mapping topology optimization reduces post-processing efforts by constructing topologies using predefined geometric features. Nevertheless, existing approaches are significantly constrained by the limited set of geometric features available, the variety of parameters that each type of geometric feature can possess, and the necessity of employing differentiable signed distance functions. In this paper, we present a novel method that combines Neural Heaviside Signed Distance Functions (Heaviside SDFs) with structured latent shape representations to generate manufacturable voids directly within the optimization framework. Our architecture incorporates encoder and decoder networks to effectively approximate the Heaviside function and facilitate optimization within a unified latent space, thus addressing the feature diversity limitations of current feature-mapping techniques. Experimental results validate the effectiveness of our approach in balancing structural compliance, offering a new pathway to CAD-integrated design with minimal human intervention.

Paperid:2382 poster

Authors:David Boetius, Stefan Leue, Tobias Sutter

Title: Solving Probabilistic Verification Problems of Neural Networks using Branch and Bound

Abstract:
Probabilistic verification problems of neural networks are concerned with formally analysing the output distribution of a neural network under a probability distribution of the inputs. Examples of probabilistic verification problems include verifying the demographic parity fairness notion or quantifying the safety of a neural network. We present a new algorithm for solving probabilistic verification problems of neural networks based on an algorithm for computing and iteratively refining lower and upper bounds on probabilities over the outputs of a neural network. By applying state-of-the-art bound propagation and branch and bound techniques from non-probabilistic neural network verification, our algorithm significantly outpaces existing probabilistic verification algorithms, reducing solving times for various benchmarks from the literature from tens of minutes to tens of seconds. Furthermore, our algorithm compares favourably even to dedicated algorithms for restricted probabilistic verification problems. We complement our empirical evaluation with a theoretical analysis, proving that our algorithm is sound and, under mildly restrictive conditions, also complete when using a suitable set of heuristics.

Paperid:2383 poster

Authors:Kihyuk Hong, Ambuj Tewari

Title: A Computationally Efficient Algorithm for Infinite-Horizon Average-Reward Linear MDPs

Abstract:
We study reinforcement learning in infinite-horizon average-reward settings with linear MDPs. Previous work addresses this problem by approximating the average-reward setting by discounted setting and employing a value iteration-based algorithm that uses clipping to constrain the span of the value function for improved statistical efficiency. However, the clipping procedure requires computing the minimum of the value function over the entire state space, which is prohibitive since the state space in linear MDP setting can be large or even infinite. In this paper, we introduce a value iteration method with efficient clipping operation that only requires computing the minimum of value functions over the set of states visited by the algorithm. Our algorithm enjoys the same regret bound as the previous work while being computationally efficient, with computational complexity that is independent of the size of the state space.

Paperid:2384 poster

Authors:Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng

Title: LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Abstract:
Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

Paperid:2385 poster

Authors:Junbin Liu, Farzan Farnia, Wing-Kin Ma

Title: Multilayer Matrix Factorization via Dimension-Reducing Diffusion Variational Inference

Abstract:
Multilayer matrix factorization (MMF) has recently emerged as a generalized model of, and potentially a more expressive approach than, the classic matrix factorization. This paper considers MMF under a probabilistic formulation, and our focus is on inference methods under variational inference. The challenge in this context lies in determining a variational process that leads to a computationally efficient and accurate approximation of the maximum likelihood inference. One well-known example is the variational autoencoder (VAE), which uses neural networks for the variational process. In this work, we take insight from variational diffusion models in the context of generative models to develop variational inference for MMF. We propose a dimension-reducing diffusion process that results in a new way to interact with the layered structures of the MMF model. Experimental results demonstrate that the proposed diffusion variational inference method leads to improved performance scores compared to several existing methods, including the VAE.

Paperid:2386 poster

Authors:Jiawei Zhang, Xuan Yang, Taiqi Wang, Yu Yao, Aleksandr Petiushko, Bo Li

Title: SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

Abstract:
Traditional autonomous driving systems often struggle to connect high-level reasoning with low-level control, leading to suboptimal and sometimes unsafe behaviors. Recent advances in multimodal large language models (MLLMs), which process both visual and textual data, offer an opportunity to unify perception and reasoning. However, effectively embedding precise safety knowledge into MLLMs for autonomous driving remains a significant challenge. To address this, we propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge. First, we introduce a Position-Dependent Cross-Entropy (PDCE) loss to improve low-level control signal predictions when values are represented as text. Second, to explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic (e.g., "red light => stop") and embeds them into a probabilistic graphical model (e.g., Markov Logic Network) to verify predicted actions using recognized environmental attributes. Additionally, our Multimodal Retrieval-Augmented Generation (RAG) model leverages video, control signals, and environmental attributes to learn from past driving experiences. Integrating PDCE, MLN, and Multimodal RAG, SafeAuto outperforms existing baselines across multiple datasets, enabling more accurate, reliable, and safer autonomous driving. The code is available athttps://github.com/AI-secure/SafeAuto.

Paperid:2387 poster

Authors:Ankit Ghosh, Gargee Kashyap, Sarthak Mittal, Nupur Jain, Raghavan B Sunoj, Abir De

Title: Learning Condensed Graph via Differentiable Atom Mapping for Reaction Yield Prediction

Abstract:
Yield of chemical reactions generally depends on the activation barrier, i.e., the energy difference between the reactant and the transition state. Computing the transition state from the reactant and product graphs requires prior knowledge of the correct node alignment (i.e., atom mapping), which is not available in yield prediction datasets. In this work, we propose YieldNet, a neural yield prediction model, which tackles these challenges. Here, we first approximate the atom mapping between the reactants and products using a differentiable node alignment network. We then use this approximate atom mapping to obtain a noisy realization of the condensed graph of reaction (CGR), which is a supergraph encompassing both the reactants and products. This CGR serves as a surrogate for the transition state graph structure. The CGR embeddings of different steps in a multi-step reaction are then passed into a transformer-guided reaction path encoder. Our experiments show that YieldNet can predict the yield more accurately than the baselines. Furthermore, the model is trained only under the distant supervision of yield values, without requiring fine-grained supervision of atom mapping.

Paperid:2388 poster

Authors:Cheng Xin, Fan Xu, Xin Ding, Jie Gao, Jiaxin Ding

Title: TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration

Abstract:
Graph Neural Networks (GNNs) have shown remarkable success across various scientific fields, yet their adoption in critical decision-making is often hindered by a lack of interpretability. Recently, intrinsic interpretable GNNs have been studied to provide insights into model predictions by identifying rationale substructures in graphs. However, existing methods face challenges when the underlying rationale subgraphs are complex and varied. In this work, we propose TopInG: Topologically Interpretable Graph Learning, a novel topological framework that leverages persistent homology to identify persistent rationale subgraphs. TopInG employs a rationale filtration learning approach to model an autoregressive generating process of rationale subgraphs, and introduces a self-adjusted topological constraint, termed topological discrepancy, to enforce a persistent topological distinction between rationale subgraphs and irrelevant counterparts. We provide theoretical guarantees that our loss function is uniquely optimized by the ground truth under specific conditions. Extensive experiments demonstrate TopInG's effectiveness in tackling key challenges, such as handling variform rationale subgraphs, balancing predictive performance with interpretability, and mitigating spurious correlations. Results show that our approach improves upon state-of-the-art methods on both predictive accuracy and interpretation quality.

Paperid:2389 poster

Authors:Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky

Title: A Variational Framework for Improving Naturalness in Generative Spoken Language Models

Abstract:
The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available athttps://github.com/b04901014/vae-gslm.

Paperid:2390 poster

Authors:Junyi Liao, Zihan Zhu, Ethan X Fang, Zhuoran Yang, Vahid Tarokh

Title: Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

Abstract:
Estimating the unknown reward functions driving agents' behavior is a central challenge in inverse games and reinforcement learning. This paper introduces a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization. Given observed player strategies and actions, we aim to reconstruct the underlying reward functions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish reward function identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building on this theoretical foundation, we propose an algorithm to learn reward from observed actions, designed to capture all plausible reward parameters by constructing confidence sets. Our algorithm works in both static and dynamic settings and is adaptable to incorporate other methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample-efficiency of our algorithm. Empirical results demonstrate the framework’s effectiveness in accurately recovering reward functions across various scenarios, offering new insights into decision-making in competitive environments.

Paperid:2391 poster

Authors:Weihang Ran, Wei Yuan, Yinqiang Zheng

Title: You Always Recognize Me (YARM): Robust Texture Synthesis Against Multi-View Corruption

Abstract:
Damage to imaging systems and complex external environments often introduce corruption, which can impair the performance of deep learning models pretrained on high-quality image data. Previous methods have focused on restoring degraded images or fine-tuning models to adapt to out-of-distribution data. However, these approaches struggle with complex, unknown corruptions and often reduce model accuracy on high-quality data. Inspired by the use of warning colors and camouflage in the real world, we propose designing a robust appearance that can enhance model recognition of low-quality image data. Furthermore, we demonstrate that certain universal features in radiance fields can be applied across objects of the same class with different geometries. We also examine the impact of different proxy models on the transferability of robust appearances. Extensive experiments demonstrate the effectiveness of our proposed method, which outperforms existing image restoration and model fine-tuning approaches across different experimental settings, and retains effectiveness when transferred to models with different architectures. Code will be available athttps://github.com/SilverRAN/YARM.

Paperid:2392 poster

Authors:Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Dianbo Sui

Title: Maximizing Intermediate Checkpoint Value in LLM Pretraining with Bayesian Optimization

Abstract:
The rapid proliferation of large language models (LLMs), such as GPT-4 and Gemini, underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. In this paper, we introduce a novel checkpoint merging strategy aimed at making efficient use of intermediate checkpoints during LLM pretraining. This method utilizes intermediate checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.

Paperid:2393 poster

Authors:Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo

Title: Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Abstract:
Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.

Paperid:2394 poster

Authors:Shuo Wang, Bokui Wang, Zhixiang Shen, Boyan Deng, zhao kang

Title: Multi-Domain Graph Foundation Models: Robust Knowledge Transfer via Topology Alignment

Abstract:
Recent advances in CV and NLP have inspired researchers to develop general-purpose graph foundation models through pre-training across diverse domains. However, a fundamental challenge arises from the substantial differences in graph topologies across domains. Additionally, real-world graphs are often sparse and prone to noisy connections and adversarial attacks. To address these issues, we propose the Multi-Domain Graph Foundation Model (MDGFM), a unified framework that aligns and leverages cross-domain topological information to facilitate robust knowledge transfer. MDGFM bridges different domains by adaptively balancing features and topology while refining original graphs to eliminate noise and align topological structures. To further enhance knowledge transfer, we introduce an efficient prompt-tuning approach. By aligning topologies, MDGFM not only improves multi-domain pre-training but also enables robust knowledge transfer to unseen domains. Theoretical analyses provide guarantees of MDGFM's effectiveness and domain generalization capabilities. Extensive experiments on both homophilic and heterophilic graph datasets validate the robustness and efficacy of our method.

Paperid:2395 poster

Authors:Siyang Liu, Han-Jia Ye

Title: TabPFN Unleashed: A Scalable and Effective Solution to Tabular Classification Problems

Abstract:
TabPFN has emerged as a promising in-context learning model for tabular data, capable of directly predicting the labels of test samples given labeled training examples. It has demonstrated competitive performance, particularly on small-scale classification tasks. However, despite its effectiveness, TabPFN still requires further refinement in several areas, including handling high-dimensional features, aligning with downstream datasets, and scaling to larger datasets. In this paper, we revisit existing variants of TabPFN and observe that most approaches focus either on reducing bias or variance, often neglecting the need to address the other side, while also increasing inference overhead. To fill this gap, we propose Beta (Bagging andEncoder-based Fine-tuning forTabPFNAdaptation), a novel and effective method designed tominimize both bias and variance. To reduce bias, we introduce a lightweight encoder to better align downstream tasks with the pre-trained TabPFN. By increasing the number of encoders in a lightweight manner, Beta mitigates variance, thereby further improving the model’s performance. Additionally, bootstrapped sampling is employed to further reduce the impact of data perturbations on the model, all while maintaining computational efficiency during inference. Our approach enhances TabPFN’s ability to handle high-dimensional data and scale to larger datasets. Experimental results on over 200 benchmark classification datasets demonstrate that Beta either outperforms or matches state-of-the-art methods.

Paperid:2396 poster

Authors:Dong Li, Yidi Liu, Xueyang Fu, Jie Huang, Senyan Xu, Qi Zhu, Zheng-Jun Zha

Title: FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining

Abstract:
Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owing to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation. Extensive experiments reveal that our method outperforms state-of-the-art methods both qualitatively and quantitatively.

Paperid:2397 poster

Authors:Jijia Liu, Feng Gao, Qingmin Liao, Chao Yu, Yu Wang

Title: Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

Abstract:
Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to “kick-start�?training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average 1.62× performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data.

Paperid:2398 poster

Authors:Thomas TCK Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, Nikolai Matni

Title: On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

Abstract:
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models,linear representation learningandsingle-index learning, which are widely used to study how typical algorithms efficiently learn usefulfeaturesto enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.

Paperid:2399 poster

Authors:Kai Lu, Shixiong Xu, Jinqiu Li, Kun Ding, Gaofeng Meng

Title: Agent Reviewers: Domain-specific Multimodal Agents with Shared Memory for Paper Review

Abstract:
Feedback from peer review is essential to improve the quality of scientific articles. However, at present, many manuscripts do not receive sufficient external feedback for refinement before or during submission. Therefore, a system capable of providing detailed and professional feedback is crucial for enhancing research efficiency. In this paper, we have compiled the largest dataset of paper reviews to date by collecting historical open-access papers and their corresponding review comments and standardizing them using LLM. We then developed a multi-agent system that mimics real human review processes, based on LLMs. This system, named Agent Reviewers, includes the innovative introduction of multimodal reviewers to provide feedback on the visual elements of papers. Additionally, a shared memory pool that stores historical papers' metadata is preserved, which supplies reviewer agents with background knowledge from different fields. Our system is evaluated using ICLR 2024 papers and achieves superior performance compared to existing AI-based review systems. Comprehensive ablation studies further demonstrate the effectiveness of each module and agent in this system.

Paperid:2400 poster

Authors:Changsheng Wang, Yihua Zhang, Jinghan Jia, Parikshit Ram, Dennis Wei, Yuguang Yao, Soumyadeep Pal, Nathalie Baracaldo, Sijia Liu

Title: Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Abstract:
Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain over-sensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely unrelated to the unlearning objective. To enhance robustness, we introduce the concept of `invariance' into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM's hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

Paperid:2401 poster

Authors:Jingxiang Qu, Wenhan Gao, Jiaxing Zhang, Xufeng Liu, Hua Wei, Haibin Ling, Yi Liu

Title: RISE: Radius of Influence based Subgraph Extraction for 3D Molecular Graph Explanation

Abstract:
3D Geometric Graph Neural Networks (GNNs) have emerged as transformative tools for modeling molecular data. Despite their predictive power, these models often suffer from limited interpretability, raising concerns for scientific applications that require reliable and transparent insights. While existing methods have primarily focused on explaining molecular substructures in 2D GNNs, the transition to 3D GNNs introduces unique challenges, such as handling the implicit dense edge structures created by a cutoff radius. To tackle this, we introduce a novel explanation method specifically designed for 3D GNNs, which localizes the explanation to the immediate neighborhood of each node within the 3D space. Each node is assigned an radius of influence, defining the localized region within which message passing captures spatial and structural interactions crucial for the model's predictions. This method leverages the spatial and geometric characteristics inherent in 3D graphs. By constraining the subgraph to a localized radius of influence, the approach not only enhances interpretability but also aligns with the physical and structural dependencies typical of 3D graph applications, such as molecular learning.

Paperid:2402 poster

Authors:Weijian Luo, colin zhang, Debing Zhang, Zhengyang Geng

Title: David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training

Abstract:
We propose Diff-Instruct*(DI*), a data-efficient post-training approach to one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes a human reward function while regularizing the generator to stay close to a reference diffusion process. Unlike traditional RLHF approaches, which rely on the KL divergence for regularization, we introduce a novel score-based divergence regularization that substantially improves performance. Although such a score-based RLHF objective seems intractable when optimizing, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. Building upon this framework, we train DI*-SDXL-1step, a 1-step text-to-image model based on Stable Diffusion-XL (2.6B parameters), capable of generating 1024x1024 resolution images in a single step. The 2.6B DI*-SDXL-1step model outperforms the 12B FLUX-dev model in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result strongly supports the thought that with proper post-training, the small one-step model is capable of beating huge multi-step models. We will open-source our industry-ready model to the community.

Paperid:2403 poster

Authors:Luca Masserano, Abdul Fatir Ansari, Boran Han, Xiyuan Zhang, Christos Faloutsos, Michael W. Mahoney, Andrew Gordon Wilson, Youngsuk Park, Syama Sundar Rangapuram, Danielle C. Maddix, Bernie Wang

Title: Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization

Abstract:
How to best develop foundational models for time series forecasting remains an important open question. Tokenization is a crucial consideration in this effort: what is an effective discrete vocabulary for a real-valued sequential input? To address this question, we develop WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon. By decomposing coarse and fine structures in the inputs, wavelets provide an eloquent and compact language for time series forecasting that simplifies learning. Empirical results on a comprehensive benchmark, including 42 datasets for both in-domain and zero-shot settings, show that WaveToken: i) performs on par or better than recently proposed foundation models for forecasting while using a much smaller vocabulary (1024 tokens), and is competitive with modern deep learning models trained specifically on each dataset; ii) exhibits superior generalization capabilities, achieving the best average rank across all datasets for three complementary metrics; and iii) easily captures complex temporal patterns of practical relevance that are challenging for other recent pre-trained models, including trends, sparse spikes, and non-stationary time series with varying frequencies evolving over time.

Paperid:2404 poster

Authors:Samuel Holt, Max Ruiz Luyten, Antonin Berthon, Mihaela van der Schaar

Title: G-Sim: Generative Simulations with Large Language Models and Gradient-Free Calibration

Abstract:
Constructing robust simulators is essential for asking "what if?" questions and guiding policy in critical domains like healthcare and logistics. However, existing methods often struggle, either failing to generalize beyond historical data or, when using Large Language Models (LLMs), suffering from inaccuracies and poor empirical alignment. We introduceG-Sim, a hybrid framework that automates simulator construction by synergizing LLM-driven structural design with rigorous empirical calibration. G-Sim employs an LLM in an iterative loop to propose and refine a simulator's core components and causal relationships, guided by domain knowledge. This structure is then grounded in reality by estimating its parameters using flexible calibration techniques. Specifically, G-Sim can leverage methods that are bothlikelihood-freeandgradient-freewith respect to the simulator, such asgradient-free optimizationfor direct parameter estimation orsimulation-based inferencefor obtaining a posterior distribution over parameters. This allows it to handle non-differentiable and stochastic simulators. By integrating domain priors with empirical evidence, G-Sim produces reliable, causally-informed simulators, mitigating data-inefficiency and enabling robust system-level interventions for complex decision-making.

Paperid:2405 poster

Authors:Junjie Wen, Yichen Zhu, Minjie Zhu, Zhibin Tang, Jinming Li, Zhongyi Zhou, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, Feifei Feng

Title: DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression

Abstract:
In this paper, we present DiffusionVLA, a novel framework that integrates autoregressive reasoning with diffusion policies to address the limitations of existing methods: while autoregressive Vision-Language-Action (VLA) models lack precise and robust action generation, diffusion-based policies inherently lack reasoning capabilities. Central to our approach is autoregressive reasoning �?a task decomposition and explanation process enabled by a pre-trained VLM �?to guide diffusion-based action policies. To tightly couple reasoning with action generation, we introduce a reasoning injection module that directly embeds self-generated reasoning phrases into the policy learning process. The framework is simple, flexible, and efficient, enabling seamless deployment across diverse robotic platforms.We conduct extensive experiments using multiple real robots to validate the effectiveness of DiVLA. Our tests include a challenging factory sorting task, where DiVLA successfully categorizes objects, including those not seen during training. The reasoning injection module enhances interpretability, enabling explicit failure diagnosis by visualizing the model’s decision process. Additionally, we test DiVLA on a zero-shot bin-picking task, achieving \textbf{63.7% accuracy on 102 previously unseen objects}. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiVLA can follow novel instructions and retain conversational ability. Notably, DiVLA is data-efficient and fast at inference; our smallest DiVLA-2B runs 82Hz on a single A6000 GPU. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.

Paperid:2406 poster

Authors:Dongyang Fan, Bettina Messmer, Nikita Doikov, Martin Jaggi

Title: On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

Abstract:
On-device LLMs have gained increasing attention for their ability to enhance privacy and provide a personalized user experience. To facilitate private learning with scarce data, Federated Learning has become a standard approach. However, it faces challenges such as computational resource heterogeneity and data heterogeneity among end users. We propose CoMiGS (Collaborative learning with aMixture ofGeneralists andSpecialists), the first approach to address both challenges. A key innovation of our method is the bi-level optimization formulation of the Mixture-of-Experts learning objective, where the router is optimized using a separate validation set to ensure alignment with the target distribution. We solve our objective with alternating minimization, for which we provide a theoretical analysis. Our method shares generalist experts across users while localizing a varying number of specialist experts, thereby adapting to users�?computational resources and preserving privacy. Through extensive experiments, we show CoMiGS effectively balances general and personalized knowledge for each token generation. We demonstrate that CoMiGS remains robust against overfitting—due to the generalists' regularizing effect—while adapting to local data through specialist expertise. We open source our codebase for collaborative LLMs.

Paperid:2407 poster

Authors:Rafal Karczewski, Markus Heinonen, Vikas K Garg

Title: Devil is in the Details: Density Guidance for Detail-Aware Generation with Flow Models

Abstract:
Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality images by mapping noise to a data distribution. However, recent findings suggest that image likelihood does not align with perceptual quality: high-likelihood samples tend to be smooth, while lower-likelihood ones are more detailed. Controlling sample density is thus crucial for balancing realism and detail. In this paper, we analyze an existing technique, Prior Guidance, which scales the latent code to influence image detail. We introduce score alignment, a condition that explains why this method works and show that it can be tractably checked for any continuous normalizing flow model. We then propose Density Guidance, a principled modification of the generative ODE that enables exact log-density control during sampling. Finally, we extend Density Guidance to stochastic sampling, ensuring precise log-density control while allowing controlled variation in structure or fine details. Our experiments demonstrate that these techniques provide fine-grained control over image detail without compromising sample quality. Code is available athttps://github.com/Aalto-QuML/density-guidance.

Paperid:2408 poster

Authors:Hongyuan Su, Yu Zheng, Yuan Yuan, Yuming Lin, Depeng Jin, Yong Li

Title: Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems

Abstract:
Training reinforcement learning (RL) agents requires extensive trials and errors, which becomes prohibitively time-consuming in systems with costly reward evaluations. To address this challenge, we propose adaptive reward modeling (AdaReMo) which accelerates RL training by decomposing the complicated reward function into multiple localized fast reward models approximating direct reward evaluation with neural networks. These models dynamically adapt to the agent’s evolving policy by fitting the currently explored subspace with the latest trajectories, ensuring accurate reward estimation throughout the entire training process while significantly reducing computational overhead. We empirically show that AdaReMo not only achieves over 1,000 times speedup but also improves the performance by 14.6% over state-of-the-art approaches across three expensive-to-evaluate systems---molecular generation, epidemic control, and spatial planning. Code and data for the project are provided athttps://github.com/tsinghua-fib-lab/AdaReMo.

Paperid:2409 poster

Authors:Wenhao Shen, Wanqi Yin, Xiaofeng Yang, Cheng Chen, Chaoyue Song, Zhongang Cai, Lei Yang, Hao Wang, Guosheng Lin

Title: ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

Abstract:
Human mesh recovery (HMR) from a single image is inherently ill-posed due to depth ambiguity and occlusions. Probabilistic methods have tried to solve this by generating numerous plausible 3D human mesh predictions, but they often exhibit misalignment with 2D image observations and weak robustness to in-the-wild images. To address these issues, we propose ADHMR, a framework thatAligns aDiffusion-basedHMRmodel in a preference optimization manner. First, we train a human mesh prediction assessment model, HMR-Scorer, capable of evaluating predictions even for in-the-wild images without 3D annotations. We then use HMR-Scorer to create a preference dataset, where each input image has a pair of winner and loser mesh predictions. This dataset is used to finetune the base model using direct preference optimization. Moreover, HMR-Scorer also helps improve existing HMR models by data cleaning, even with fewer training samples. Extensive experiments show that ADHMR outperforms current state-of-the-art methods. Code is available at:https://github.com/shenwenhao01/ADHMR.

Paperid:2410 poster

Authors:Cong Hua, Qianqian Xu, Zhiyong Yang, Zitai Wang, Shilong Bao, Qingming Huang

Title: OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning

Abstract:
Prompt tuning adapts Vision-Language Models like CLIP to open-world tasks with minimal training costs. In this direction, one typical paradigm evaluates model performanceseparatelyon known classes (i.e., base domain) and unseen classes (i.e., new domain). However, real-world scenarios require models to handle inputswithout prior domain knowledge. This practical challenge has spurred the development ofopen-world prompt tuning, which demands a unified evaluation of two stages: 1) detecting whether an input belongs to the base or new domain (P1), and 2) classifying the sample into its correct class (P2). What's more, as domain distributions are generally unknown, a proper metric should be insensitive to varying base/new sample ratios (P3). However, we find that current metrics, including HM, overall accuracy, and AUROC, fail to satisfy these three properties simultaneously. To bridge this gap, we propose $\mathsf{OpenworldAUC}$, a unified metric that jointly assesses detection and classification through pairwise instance comparisons. To optimize $\mathsf{OpenworldAUC}$ effectively, we introduceGated Mixture-of-Prompts (GMoP), which employs domain-specific prompts and a gating mechanism to dynamically balance detection and classification. Theoretical guarantees ensure generalization of GMoP under practical conditions. Experiments on 15 benchmarks in open-world scenarios show GMoP achieves SOTA performance on $\mathsf{OpenworldAUC}$ and other metrics.

Paperid:2411 poster

Authors:Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan, Longtao Huang, Hui Xue, Wenwu Zhu

Title: Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning

Abstract:
Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks. However, most existing methods adopt a fixed architecture, struggling with adapting to new tasks due to static model capacity. We propose to evolve the architecture under parameter budgets for dynamic task adaptation, which remains unexplored and imposes two challenges: 1) task architecture conflict, where different tasks require varying layer-wise adaptations, and 2) modality imbalance, where different tasks rely unevenly on modalities, leading to unbalanced updates. To address these challenges, we propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM's architecture with controlled parameter budgets to continually adapt to new tasks while retaining previously learned knowledge. Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts, and routes instructions layer-wisely to facilitate knowledge sharing among experts. Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each modality within the task to alleviate the modality imbalance problem. Extensive experiments show that D-MoLE significantly outperforms state-of-the-art baselines, achieving a 15 percent average improvement over the best baseline. To the best of our knowledge, this is the first study of continual learning for MLLMs from an architectural perspective.

Paperid:2412 poster

Authors:Vicente Balmaseda, Bokun Wang, Ching-Long Lin, Tianbao Yang

Title: Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning

Abstract:
In self-supervised contrastive learning, negative pairs are typically constructed using an anchor image and a sample drawn from the entire dataset, excluding the anchor. However, this approach can result in the creation of negative pairs with similar semantics, referred to as "false negatives", leading to their embeddings being falsely pushed apart. To address this issue, we introduceGloFND, an optimization-based approach that automatically learns on the fly the threshold for each anchor data toidentifyits false negatives during training. In contrast to previous methods for false negative discovery, our approachgloballydetects false negatives across the entire dataset rather than locally within the mini-batch. Moreover, its per-iteration computation cost remains independent of the dataset size. Experimental results on image and image-text data demonstrate the effectiveness of the proposed method. Our implementation is available athttps://github.com/vibalcam/GloFND.

Paperid:2413 poster

Authors:Hongyi Liu, Rajarshi Saha, Zhen Jia, Youngsuk Park, Jiaji Huang, Shoham Sabach, Yu-Xiang Wang, George Karypis

Title: PROXSPARSE: REGULARIZED LEARNING OF SEMI-STRUCTURED SPARSITY MASKS FOR PRETRAINED LLMS

Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance in natural language processing tasks, yet their massive size makes serving them inefficient and costly. Semi-structured pruning has emerged as an effective method for model acceleration, but existing approaches are suboptimal because they focus on local, layer-wise optimizations using heuristic rules, failing to leverage global feedback. We present ProxSparse, a learning-based framework for mask selection enabled by regularized optimization. ProxSparse transforms the rigid, non-differentiable mask selection process into a smoother optimization procedure, allowing gradual mask exploration with flexibility. ProxSparse does not involve additional weight updates once the mask is determined. Our extensive evaluations on 7 widely used models show that ProxSparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning.

Paperid:2414 poster

Authors:Yusong Wang, Shiyin Tan, Jialun Shen, Yicheng Xu, Haobo Song, Qi Xu, Prayag Tiwari, Mingkun Xu

Title: Enhancing Graph Contrastive Learning for Protein Graphs from Perspective of Invariance

Abstract:
Graph Contrastive Learning (GCL) improves Graph Neural Network (GNN)-based protein representation learning by enhancing its generalization and robustness. Existing GCL approaches for protein representation learning rely on 2D topology, where graph augmentation is solely based on topological features, ignoring the intrinsic biological properties of proteins. Besides, 3D structure-based protein graph augmentation remains unexplored, despite proteins inherently exhibiting 3D structures. To bridge this gap, we propose novel biology-aware graph augmentation strategies from the perspective of invariance and integrate them into the protein GCL framework. Specifically, we introduce Functional Community Invariance (FCI)-based graph augmentation, which employs spectral constraints to preserve topology-driven community structures while incorporating residue-level chemical similarity as edge weights to guide edge sampling and maintain functional communities. Furthermore, we propose 3D Protein Structure Invariance (3-PSI)-based graph augmentation, leveraging dihedral angle perturbations and secondary structure rotations to retain critical 3D structural information of proteins while diversifying graph views. Extensive experiments on four different protein-related tasks demonstrate the superiority of our proposed GCL protein representation learning framework.

Paperid:2415 poster

Authors:Zhehan Zhao, Lu Bai, Lixin Cui, Ming Li, Ziyu Lyu, Lixiang Xu, Yue Wang, Edwin Hancock

Title: ENAHPool: The Edge-Node Attention-based Hierarchical Pooling for Graph Neural Networks

Abstract:
Graph Neural Networks (GNNs) have emerged as powerful tools for graph learning, and one key challenge arising in GNNs is the development of effective pooling operations for learning meaningful graph representations. In this paper, we propose a novel Edge-Node Attention-based Hierarchical Pooling (ENAHPool) operation for GNNs. Unlike existing cluster-based pooling methods that suffer from ambiguous node assignments and uniform edge-node information aggregation, ENAHPool assigns each node exclusively to a cluster and employs attention mechanisms to perform weighted aggregation of both node features within clusters and edge connectivity strengths between clusters, resulting in more informative hierarchical representations. To further enhance the model performance, we introduce a Multi-Distance Message Passing Neural Network (MD-MPNN) that utilizes edge connectivity strength information to enable direct and selective message propagation across multiple distances, effectively mitigating the over-squashing problem in classical MPNNs. Experimental results demonstrate the effectiveness of the proposed method.

Paperid:2416 poster

Authors:Yudong Hu, Haoran Li, Congying Han, Tiande Guo, Bonan Li, Mingqiang Li

Title: A-PSRO: A Unified Strategy Learning Method with Advantage Metric for Normal-form Games

Abstract:
Solving the Nash equilibrium in normal-form games with large-scale strategy spaces presents significant challenges. Open-ended learning frameworks, such as PSRO and its variants, have emerged as effective solutions. However, these methods often lack an efficient metric for evaluating strategy improvement, which limits their effectiveness in approximating equilibria. In this paper, we introduce a novel evaluative metric called Advantage, which possesses desirable properties inherently connected to the Nash equilibrium, ensuring that each strategy update approaches equilibrium. Building upon this, we propose the Advantage Policy Space Response Oracle (A-PSRO), an innovative unified open-ended learning framework applicable to both zero-sum and general-sum games. A-PSRO leverages the Advantage as a refined evaluation metric, leading to a consistent learning objective for agents in normal-form games. Experiments showcase that A-PSRO significantly reduces exploitability in zero-sum games and improves rewards in general-sum games, outperforming existing algorithms and validating its practical effectiveness.

Paperid:2417 poster

Authors:Hengrui Lou, Zunlei Feng, Jinsong Geng, Erteng Liu, Jie Lei, Lechao Cheng, Jie Song, Mingli Song, Yijun Bei

Title: STD-FD: Spatio-Temporal Distribution Fitting Deviation for AIGC Forgery Identification

Abstract:
With the rise of AIGC technologies, particularly diffusion models, highly realistic fake images that can deceive human visual perception has become feasible. Consequently, various forgery detection methods have emerged. However, existing methods treat the generation process of fake images as either a black-box or an auxiliary tool, offering limited insights into its underlying mechanisms. In this paper, we propose Spatio-Temporal Distribution Fitting Deviation (STD-FD) for AIGC forgery detection, which explores the generative process in detail. By decomposing and reconstructing data within generative diffusion models, initial experiments reveal temporal distribution fitting deviations during the image reconstruction process. These deviations are captured through reconstruction noise maps for each spatial semantic unit, derived via a super-resolution algorithm. Critical discriminative patterns, termed DFactors, are identified through statistical modeling of these deviations. Extensive experiments show that STD-FD effectively captures distribution patterns in AIGC-generated data, demonstrating strong robustness and generalizability while outperforming state-of-the-art (SOTA) methods on major datasets. The source code is available atthis link.

Paperid:2418 poster

Authors:Bartosz Cywiński, Kamil Deja

Title: SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Abstract:
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available atGitHub.

Paperid:2419 poster

Authors:Guoxuan Chen, Han Shi, Jiawei Li, Yihang Gao, Xiaozhe Ren, Yimeng Chen, Xin Jiang, Zhenguo Li, Weiyang Liu, Chao Huang

Title: SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Abstract:
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

Paperid:2420 poster

Authors:Chenyou Fan, Chenjia Bai, Zhao Shan, Haoran He, Yang Zhang, Zhen Wang

Title: Task-Agnostic Pre-training and Task-Guided Fine-tuning for Versatile Diffusion Planner

Abstract:
Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). They are costly due to the substantial human efforts required to collect expert data or design reward functions. To address these challenges, we aim to develop a versatile diffusion planner capable of leveraging large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose SODP, a two-stage framework that leverages Sub-Optimal data to learn a Diffusion Planner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to quickly refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.

Paperid:2421 poster

Authors:Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

Title: Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Abstract:
Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at "mixed precision" through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP's superior performance across five video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available athttps://github.com/steven-ccq/ViLAMP.

Paperid:2422 poster

Authors:Mateusz Piotrowski, Paul M. Riechers, Daniel Filan, Adam Shai

Title: Constrained Belief Updates Explain Geometric Structures in Transformer Representations

Abstract:
What computational structures emerge in transformers trained on next-token prediction? In this work, we provide evidence that transformers implement constrained Bayesian belief updating---a parallelized version of partial Bayesian inference shaped by architectural constraints. We integrate the model-agnostic theory of optimal prediction with mechanistic interpretability to analyze transformers trained on a tractable family of hidden Markov models that generate rich geometric patterns in neural activations. Our primary analysis focuses on single-layer transformers, revealing how the first attention layer implements these constrained updates, with extensions to multi-layer architectures demonstrating how subsequent layers refine these representations. We find that attention carries out an algorithm with a natural interpretation in the probability simplex, and create representations with distinctive geometric structure. We show how both the algorithmic behavior and the underlying geometry of these representations can be theoretically predicted in detail---including the attention pattern, OV-vectors, and embedding vectors---by modifying the equations for optimal future token predictions to account for the architectural constraints of attention. Our approach provides a principled lens on how architectural constraints shape the implementation of optimal prediction, revealing why transformers develop specific intermediate geometric structures.

Paperid:2423 poster

Authors:Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

Title: Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Abstract:
Large language models (LLMs) have presented impressive performance but often lack the flexibility to adapt to human preferences quickly without retraining. Inspired by the recent efforts on test-time scaling, we make the first attempt to propose Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, eliminating the need to update model parameters. Instead of relying on purely numerical rewards, TPO translates reward signals into \emph{textual} critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth of the inference process. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly.

Paperid:2424 poster

Authors:Thibaut Boissin, Franck Mamalet, Thomas Fel, Agustin Martin Picard, Thomas Massena, Mathieu Serrurier

Title: An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures

Abstract:
Orthogonal convolutional layers are valuable components in multiple areas of machine learning, such as adversarial robustness, normalizing flows, GANs, and Lipschitz-constrained models. Their ability to preserve norms and ensure stable gradient propagation makes them valuable for a large range of problems. Despite their promise, the deployment of orthogonal convolution in large-scale applications is a significant challenge due to computational overhead and limited support for modern features like strides, dilations, group convolutions, and transposed convolutions. In this paper, we introduceAOC(Adaptive Orthogonal Convolution), a scalable method that extends a previous method (BCOP), effectively overcoming existing limitations in the construction of orthogonal convolutions. This advancement unlocks the construction of architectures that were previously considered impractical. We demonstrate through our experiments that our method produces expressive models that become increasingly efficient as they scale. To foster further advancement, we provide an open-source python package implementing this method, calledOrthogonium.

Paperid:2425 poster

Authors:Jiawei Huang, Bingcong Li, Christoph Dann, Niao He

Title: Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Abstract:
Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm---TransferPolicyOptimization (TPO)---with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Paperid:2426 poster

Authors:Zhuo He, Shuang Li, Wenze Song, Longhui Yuan, Jian Liang, Han Li, Kun Gai

Title: Learning Time-Aware Causal Representation for Model Generalization in Evolving Domains

Abstract:
Endowing deep models with the ability to generalize in dynamic scenarios is of vital significance for real-world deployment, given the continuous and complex changes in data distribution. Recently, evolving domain generalization (EDG) has emerged to address distribution shifts over time, aiming to capture evolving patterns for improved model generalization. However, existing EDG methods may suffer from spurious correlations by modeling only the dependence between data and targets across domains, creating a shortcut between task-irrelevant factors and the target, which hinders generalization. To this end, we design a time-aware structural causal model (SCM) that incorporates dynamic causal factors and the causal mechanism drifts, and proposeStatic-DYNamicCausal Representation Learning (SYNC), an approach that effectively learns time-aware causal representations. Specifically, it integrates specially designed information-theoretic objectives into a sequential VAE framework which captures evolving patterns, and produces the desired representations by preserving intra-class compactness of causal factors both across and within domains. Moreover, we theoretically show that our method can yield the optimal causal predictor for each time domain. Results on both synthetic and real-world datasets exhibit that SYNC can achieve superior temporal generalization performance.

Paperid:2427 poster

Authors:Anjiang Wei, Allen Nie, Thiago S. F. X. Teixeira, Rohan Yadav, Wonchan Lee, Ke Wang, Alex Aiken

Title: Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces

Abstract:
Modern scientific discovery increasingly relies on high-performance computing for complex modeling and simulation. A key challenge in improving parallel program performance is efficiently mapping tasks to processors and data to memory, a process dictated by intricate, low-level system code known asmappers. Developing high-performance mappers demands days of manual tuning, posing a significant barrier for domain scientists without systems expertise. We introduce a framework that automates mapper development with generative optimization, leveraging richer feedback beyond scalar performance metrics. Our approach features the Agent-System Interface, which includes a Domain-Specific Language (DSL) to abstract away the low-level complexity of system code and define a structured search space, as well as AutoGuide, a mechanism that interprets raw execution output into actionable feedback. Unlike traditional reinforcement learning methods such as OpenTuner, which rely solely on scalar feedback, our method finds superior mappers in far fewer iterations. With just 10 iterations, it outperforms OpenTuner even after 1000 iterations, achieving3.8×faster performance. Our approach finds mappers that surpass expert-written mappers by up to1.34×speedup across nine benchmarks while reducing tuning time from days to minutes.

Paperid:2428 poster

Authors:Jongha Jon Ryu, Abhin Shah, Gregory W. Wornell

Title: A Unified View on Learning Unnormalized Distributions via Noise-Contrastive Estimation

Abstract:
This paper studies a family of estimators based on noise-contrastive estimation (NCE) for learning unnormalized distributions. The main contribution of this work is to provide a unified perspective on various methods for learning unnormalized distributions, which have been independently proposed and studied in separate research communities, through the lens of NCE. This unified view offers new insights into existing estimators. Specifically, for exponential families, we establish the finite-sample convergence rates of the proposed estimators under a set of regularity assumptions, most of which are new.

Paperid:2429 poster

Authors:Philip Zmushko, Aleksandr Beznosikov, Martin Takáč, Samuel Horváth

Title: FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Abstract:
With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, theeffective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (Full-RankUpdates withGrAdient spLitting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

Paperid:2430 poster

Authors:Nayoung Lee, Ziyang Cai, Avi Schwarzschild, Kangwook Lee, Dimitris Papailiopoulos

Title: Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges

Abstract:
Large language models often struggle with length generalization and solving complex problem instances beyond their training distribution. We present a self-improvement approach where models iteratively generate and learn from their own solutions, progressively tackling harder problems while maintaining a standard transformer architecture. Across diverse tasks including arithmetic, string manipulation, and maze solving, our method enables models to solve problems far beyond their initial training distribution—for instance, generalizing from 10-digit to 100-digit addition without apparent saturation. We observe that filtering for correct self-generated examples leads to exponential improvements in out-of-distribution performance across training rounds. Additionally, starting from pretrained models significantly accelerates this self-improvement process for several tasks. Our results demonstrate how controlled weak-to-strong curricula can systematically expand model capabilities while preserving architectural simplicity.

Paperid:2431 poster

Authors:Jicong Fan

Title: Graph Minimum Factor Distance and Its Application to Large-Scale Graph Data Clustering

Abstract:
Measuring the distance or similarity between graphs is the foundation of many graph analysis tasks, such as graph classification and clustering, but remains a challenge on large datasets. In this work, we treat the adjacency matrices of two graphs as two kernel matrices given by some unknown indefinite kernel function performed on two discrete distributions and define the distance between the two distributions as a measure, called MMFD, of the dissimilarity between two graphs. We show that MMFD is a pseudo-metric. Although the initial definition of MMFD seems complex, we show that it has a closed-form solution with extremely simple computation. To further improve the efficiency of large-scale clustering, we propose an MMFD-KM with linear space and time complexity with respect to the number of graphs. We also provide a generalization of MMFD, called MFD, which is more effective in exploiting the information of factors of adjacency matrices. The experiments on simulated graphs intuitively show that our methods are effective in comparing graphs. The experiments on real-world datasets demonstrate that, compared to the competitors, our methods have much better clustering performance in terms of three evaluation metrics and time cost.

Paperid:2432 poster

Authors:Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, Matthieu Wyart

Title: How Compositional Generalization and Creativity Improve as Diffusion Models are Trained

Abstract:
Natural data is often organized as a hierarchical composition of features. How many samples do generative models need in order to learn the composition rules, so as to produce a combinatorially large number of novel data? What signal in the data is exploited to learn those rules? We investigate these questions in the context of diffusion models both theoretically and empirically. Theoretically, we consider a simple probabilistic context-free grammar - a tree-like graphical model used to represent the hierarchical and compositional structure of data such as language and images. We demonstrate that diffusion models learn the grammar's composition rules with the sample complexity required for clustering features with statistically similar context, a process similar to the word2vec algorithm. However, this clustering emerges hierarchically: higher-level features associated with longer contexts require more data to be identified. This mechanism leads to a sample complexity that scales polynomially with the said context size. As a result, diffusion models trained on an intermediate dataset size generate data coherent up to a certain scale, but lacking global coherence. We test these predictions across different domains and find remarkable agreement: both generated texts and images achieve progressively larger coherence lengths as the training time or dataset size grows. We discuss connections between the hierarchical clustering mechanism we introduce here and the renormalization group in physics.

Paperid:2433 poster

Authors:Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar

Title: ELMO : Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

Abstract:
Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations---gradient fusion and chunking---enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7GiB required by the optimized SOTA method, Renee without compromising accuracy.

Paperid:2434 poster

Authors:Rong-Xi Tan, Ming Chen, Ke Xue, Yao Wang, Yaoyuan Wang, Fu Sheng, Chao Qian

Title: Towards Universal Offline Black-Box Optimization via Learning Language Model Embeddings

Abstract:
The pursuit of universal black-box optimization (BBO) algorithms is a longstanding goal. However, unlike domains such as language or vision, where scaling structured data has driven generalization, progress in offline BBO remains hindered by the lack of unified representations for heterogeneous numerical spaces. Thus, existing offline BBO approaches are constrained to single-task and fixed-dimensional settings, failing to achieve cross-domain universal optimization. Recent advances in language models (LMs) offer a promising path forward: their embeddings capture latent relationships in a unifying way, enabling universal optimization across different data types possible. In this paper, we discuss multiple potential approaches, including an end-to-end learning framework in the form of next-token prediction, as well as prioritizing the learning of latent spaces with strong representational capabilities. To validate the effectiveness of these methods, we collect offline BBO tasks and data from open-source academic works for training. Experiments demonstrate the universality and effectiveness of our proposed methods. Our findings suggest that unifying language model priors and learning string embedding space can overcome traditional barriers in universal BBO, paving the way for general-purpose BBO algorithms. The code is provided athttps://github.com/lamda-bbo/universal-offline-bbo.

Paperid:2435 poster

Authors:Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi

Title: Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models

Abstract:
While foundation models have been exploited for various expert tasks with their fine-tuned parameters, any foundation model will be eventually outdated due to its old knowledge or limited capability, and thus should be replaced by a new foundation model. Subsequently, to benefit from its latest knowledge or improved capability, the new foundation model should be fine-tuned on each task again, which incurs not only the additional training cost but also the maintenance cost of the task-specific data. Existing work address this problem by inference-time tuning, i.e., modifying the output probability from the new foundation model by the outputs from the old foundation model and its fine-tuned model, which involves an additional inference cost by the latter two models. In this paper, we explore a new fine-tuning principle (which we call portable reward tuning; PRT) that reduces the inference cost by its nature, based on the reformulation of fine-tuning as the reward maximization with Kullback-Leibler regularization. Specifically, instead of fine-tuning parameters of the foundation models, PRT trains the reward model explicitly through the same loss as in fine-tuning. During inference, the reward model can be used with any foundation model (with the same set of vocabularies or labels) through the formulation of reward maximization. Experimental results, including both vision and language models, show that the PRT-trained model can achieve comparable accuracy with less inference cost, in comparison to the existing work of inference-time tuning.

Paperid:2436 poster

Authors:Shuyuan Wang, Philip D Loewen, Michael Forbes, Bhushan Gopaluni, Wei Pan

Title: DiLQR: Differentiable Iterative Linear Quadratic Regulator via Implicit Differentiation

Abstract:
While differentiable control has emerged as a powerful paradigm combining model-free flexibility with model-based efficiency, the iterative Linear Quadratic Regulator (iLQR) remains underexplored as a differentiable component. The scalability of differentiating through extended iterations and horizons poses significant challenges, hindering iLQR from being an effective differentiable controller. This paper introduces DiLQR, a framework that facilitates differentiation through iLQR, allowing it to serve as a trainable and differentiable module, either as or within a neural network. A novel aspect of this framework is the analytical solution that it provides for the gradient of an iLQR controller through implicit differentiation, which ensures a constant backward cost regardless of iteration, while producing an accurate gradient. We evaluate our framework on imitation tasks on famous control benchmarks. Our analytical method demonstrates superior computational performance, achieving up to $\textbf{128x}$ speedup and a minimum of $\textbf{21x}$ speedup compared to automatic differentiation. Our method also demonstrates superior learning performance ($\mathbf{10^6x}$) compared to traditional neural network policies and better model loss with differentiable controllers that lack exact analytical gradients. Furthermore, we integrate our module into a larger network with visual inputs to demonstrate the capacity of our method for high-dimensional, fully end-to-end tasks. Codes can be found on the project homepage~\url{https://sites.google.com/view/dilqr/}.

Paperid:2437 poster

Authors:Kulin Shah, Alkis Kalavasis, Adam Klivans, Giannis Daras

Title: Does Generation Require Memorization? Creative Diffusion Models using Ambient Diffusion

Abstract:
There is strong empirical evidence that the stateof-the-art diffusion modeling paradigm leads to models that memorize the training set, especially when the training set is small. Prior methods to mitigate the memorization problem often lead to decrease in image quality. Is it possible to obtain strong and creative generative models, i.e., models that achieve high generation quality and low memorization? Despite the current pessimistic landscape of results, we make significant progress in pushing the trade-off between fidelity and memorization. We first provide theoretical evidence that memorization in diffusion models is only necessary for denoising problems at low noise scales (usually used in generating high-frequency details). Using this theoretical insight, we propose a simple, principled method to train the diffusion models using noisy data at large noise scales. We show that our method significantly reduces memorization without decreasing the image quality, for both text-conditional and unconditional models and for a variety of data availability settings.

Paperid:2438 poster

Authors:Yiding Chen, Yiyi Zhang, Owen Oertell, Wen Sun

Title: Convergence of Consistency Model with Multistep Sampling under General Data Assumptions

Abstract:
Diffusion models accomplish remarkable success in data generation tasks across various domains. However, the iterative sampling process is computationally expensive. Consistency models are proposed to learn consistency functions to map from noise to data directly, which allows one-step fast data generation and multistep sampling to improve sample quality. In this paper, we study the convergence of consistency models when the self-consistency property holds approximately under the training distribution. Our analysis requires only mild data assumption and applies to a family of forward processes. When the target data distribution has bounded support or has tails that decay sufficiently fast, we show that the samples generated by the consistency model are close to the target distribution in Wasserstein distance; when the target distribution satisfies some smoothness assumption, we show that with an additional perturbation step for smoothing, the generated samples are close to the target distribution in total variation distance. We provide two case studies with commonly chosen forward processes to demonstrate the benefit of multistep sampling.

Paperid:2439 poster

Authors:Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

Title: From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Abstract:
The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark’s alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

Paperid:2440 poster

Authors:Yunyi Shen, Hao Sun, Jean-Francois Ton

Title: Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Abstract:
Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balanceexploration of the representation spaceand makeinformative comparisonsbetween pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from theclassical experimental designliterature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF. Code and embeddings to reproduce all results of this paper are available athttps://github.com/YunyiShen/ARM-FI/.

Paperid:2441 poster

Authors:Jonas Schweisthal, Dennis Frauen, Maresa Schröder, Konstantin Hess, Niki Kilbertus, Stefan Feuerriegel

Title: Learning Representations of Instruments for Partial Identification of Treatment Effects

Abstract:
Reliable estimation of treatment effects from observational data is important in many disciplines such as medicine. However, estimation is challenging when unconfoundedness as a standard assumption in the causal inference literature is violated. In this work, we leverage arbitrary (potentially high-dimensional) instruments to estimate bounds on the conditional average treatment effect (CATE). Our contributions are three-fold: (1) We propose a novel approach for partial identification through a mapping of instruments to a discrete representation space so that we yield valid bounds on the CATE. This is crucial for reliable decision-making in real-world applications. (2) We derive a two-step procedure that learns tight bounds using a tailored neural partitioning of the latent instrument space. As a result, we avoid instability issues due to numerical approximations or adversarial training. Furthermore, our procedure aims to reduce the estimation variance in finite-sample settings to yield more reliable estimates. (3) We show theoretically that our procedure obtains valid bounds while reducing estimation variance. We further perform extensive experiments to demonstrate the effectiveness across various settings. Overall, our procedure offers a novel path for practitioners to make use of potentially high-dimensional instruments (e.g., as in Mendelian randomization).

Paperid:2442 poster

Authors:Chenrui Wang, Yixuan Qiu

Title: The Sparse-Plus-Low-Rank Quasi-Newton Method for Entropic-Regularized Optimal Transport

Abstract:
The entropic-regularized optimal transport (OT) has gained massive attention in machine learning due to its ability to provide scalable solutions for OT-based tasks. However, most of the existing algorithms, including the Sinkhorn algorithm and its extensions, suffer from relatively slow convergence in many cases. More recently, some second-order methods have been proposed based on the idea of Hessian sparsification. Despite their promising results, they have two major issues: first, there is limited theoretical understanding on the effect of sparsification; second, in cases where the transport plan is dense, Hessian sparsification does not perform well. In this paper, we propose a new quasi-Newton method to address these problems. First, we develop new theoretical analyses to understand the benefits of Hessian sparsification, which lays the foundation for highly flexible sparsification schemes. Then we introduce an additional low-rank term in the approximate Hessian to better handle the dense case. Finally, the convergence properties of the proposed algorithm are rigorously analyzed, and various numerical experiments are conducted to demonstrate its improved performance in solving large-scale OT problems.

Paperid:2443 poster

Authors:Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Hengshuang Zhao, Zhou Zhao

Title: Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

Abstract:
Orientation is a fundamental attribute of objects, essential for understanding their spatial pose and arrangement. However, practical solutions for estimating the orientation of open-world objects in monocular images remain underexplored. In this work, we introduce Orient Anything, the first foundation model for zero-shot object orientation estimation. A key challenge in this task is the scarcity of orientation annotations for open-world objects. To address this, we propose leveraging the vast resources of 3D models. By developing a pipeline to annotate the front face of 3D objects and render them from random viewpoints, we curate 2 million images with precise orientation annotations across a wide variety of object categories. To fully leverage the dataset, we design a robust training objective that models the 3D orientation as probability distributions over three angles and predicts the object orientation by fitting these distributions. Besides, we propose several strategies to further enhance the synthetic-to-real transfer. Our model achieves state-of-the-art orientation estimation accuracy on both rendered and real images, demonstrating impressive zero-shot capabilities across various scenarios. Furthermore, it shows great potential in enhancing high-level applications, such as understanding complex spatial concepts in images and adjusting 3D object pose.

Paperid:2444 poster

Authors:Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li

Title: EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Abstract:
Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging or concatenating their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although tuning-based approaches can effectively extract consistent elements within multiple images through the training process, it necessitates test-time finetuning for each distinct image group. This paper introduces EasyRef, a plug-and-play adaption method that empowers diffusion models to condition consistent visual elements (e.g., style and human facial identity, etc.) across multiple reference images under instruction controls. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free and tuning-based methods, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.

Paperid:2445 poster

Authors:Boyuan Li, Yicheng Luo, Zhen Liu, Junhao Zheng, Jianming Lv, Qianli Ma

Title: HyperIMTS: Hypergraph Neural Network for Irregular Multivariate Time Series Forecasting

Abstract:
Irregular multivariate time series (IMTS) are characterized by irregular time intervals within variables and unaligned observations across variables, posing challenges in learning temporal and variable dependencies. Many existing IMTS models either require padded samples to learn separately from temporal and variable dimensions, or represent original samples via bipartite graphs or sets. However, the former approaches often need to handle extra padding values affecting efficiency and disrupting original sampling patterns, while the latter ones have limitations in capturing dependencies among unaligned observations. To represent and learn both dependencies from original observations in a unified form, we propose HyperIMTS, aHypergraph neural network forIrregularMultivariateTimeSeries forecasting. Observed values are converted as nodes in the hypergraph, interconnected by temporal and variable hyperedges to enable message passing among all observations. Through irregularity-aware message passing, HyperIMTS captures variable dependencies in a time-adaptive way to achieve accurate forecasting. Experiments demonstrate HyperIMTS's competitive performance among state-of-the-art models in IMTS forecasting with low computational cost. Our code is available athttps://github.com/qianlima-lab/PyOmniTS.

Paperid:2446 poster

Authors:Nabeel Seedat, Mihaela van der Schaar

Title: Bootstrapping Self-Improvement of Language Model Programs for Zero-Shot Schema Matching

Abstract:
Schema matching -- the task of finding matches between attributes across disparate data sources with different tables and hierarchies -- is critical for creating interoperable machine learning (ML)-ready data. Addressing this fundamental data-centric problem has wide implications, especially in domains like healthcare, finance and e-commerce --- but also has the potential to benefit ML models more generally, by increasing the data available for ML model training. However, schema matching is a challenging ML task due to structural/hierarchical and semantic heterogeneity between different schemas. Previous ML approaches to automate schema matching have either required significant labeled data for model training, which is often unrealistic or suffer from poor zero-shot performance. To this end, we propose Matchmaker - a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker also self-improves in a zero-shot manner without the need for labeled demonstrations via a novel optimization approach, which constructs synthetic in-context demonstrations to guide the language model's reasoning process. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data.

Paperid:2447 poster

Authors:ZiHao Xu, Dawei xu, Zihan Li, Chuan Zhang

Title: MDDM: Practical Message-Driven Generative Image Steganography Based on Diffusion Models

Abstract:
Generative image steganography (GIS) is an emerging technique that conceals secret messages in the generation of images. Compared to GAN-based or flow-based GIS schemes, diffusion model-based solutions can provide high-quality and more diverse images, thus receiving considerable attention recently. However, previous GIS schemes still face challenges in terms of extraction accuracy, controllability, and practicality. To address the above issues, this paper proposes a practical message-driven GIS framework based on diffusion models, called MDDM. Specifically, by utilizing Cardan Grille, we encode messages into Gaussian noise, which serves as the initial input for image generation, enabling users to generate diverse images via controllable prompts without additional training. During the information extraction process, receivers only need to use the pre-shared Cardan Grille to perform exact diffusion inversion and recover the messages without requiring the image generation seeds or prompts. Experimental results demonstrate that MDDM offers notable advantages in terms of accuracy, controllability, practicality, and security. With flexible strategies, MDDM can always achieve almost 100% accuracy. Additionally, MDDM demonstrates certain robustness and exhibits potential for application in watermarking tasks.

Paperid:2448 poster

Authors:Wenhao Zhao, Qiushui Xu, Linjie Xu, Lei Song, Jinyu Wang, Chunlai Zhou, Jiang Bian

Title: Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning

Abstract:
Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.

Paperid:2449 poster

Authors:Rudy Morel, Jiequn Han, Edouard Oyallon

Title: DISCO: learning to DISCover an evolution Operator for multi-physics-agnostic prediction

Abstract:
We address the problem of predicting the next states of a dynamical system governed byunknowntemporal partial differential equations (PDEs) using only a short trajectory. While standard transformers provide a natural black-box solution to this task, the presence of a well-structured evolution operator in the data suggests a more tailored and efficient approach. Specifically, when the PDE is fully known, classical numerical solvers can evolve the state accurately with only a few parameters. Building on this observation, we introduce DISCO, a model that uses a large hypernetwork to process a short trajectory and generate the parameters of a much smaller operator network, which then predicts the next states through time integration. Our framework decouples dynamics estimation -- i.e., DISCovering an evolution Operator from a short trajectory -- from state prediction -- i.e., evolving this operator. Experiments show that pretraining our model on diverse physics datasets achieves state-of-the-art performance while requiring significantly fewer epochs. Moreover, it generalizes well to unseen initial conditions and remains competitive when fine-tuned on downstream tasks.

Paperid:2450 poster

Authors:Yujun Huang, Bin Chen, Niu Lian, Xin Wang, Baoyi An, Tao Dai, Shu-Tao Xia

Title: 3D-LMVIC: Learning-based Multi-View Image Compression with 3D Gaussian Geometric Priors

Abstract:
Existing multi-view image compression methods often rely on 2D projection-based similarities between views to estimate disparities. While effective for small disparities, such as those in stereo images, these methods struggle with the more complex disparities encountered in wide-baseline multi-camera systems, commonly found in virtual reality and autonomous driving applications. To address this limitation, we propose 3D-LMVIC, a novel learning-based multi-view image compression framework that leverages 3D Gaussian Splatting to derive geometric priors for accurate disparity estimation. Furthermore, we introduce a depth map compression model to minimize geometric redundancy across views, along with a multi-view sequence ordering strategy based on a defined distance measure between views to enhance correlations between adjacent views. Experimental results demonstrate that 3D-LMVIC achieves superior performance compared to both traditional and learning-based methods. Additionally, it significantly improves disparity estimation accuracy over existing two-view approaches.

Paperid:2451 poster

Authors:Shuoyuan Wang, Yixuan Li, Hongxin Wei

Title: Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

Abstract:
Confidence calibration is critical for the safe deployment of machine learning models in the real world. However, such issue in vision-language models like CLIP, particularly after fine-tuning, has not been fully addressed. In this work, we demonstrate that existing prompt tuning methods usually lead to a trade-off of calibration between base and new classes: the cross-entropy loss used in standard fine-tuning (e.g., CoOp) causes overconfidence in new classes by increasing textual label divergence, whereas regularization-based tuning (e.g., KgCoOp) maintains the confidence level but results in underconfidence in base classes due to the improved accuracy. Inspired by the observations, we introduce Dynamic Outlier Regularization (DOR) to ensure the confidence calibration on both base and new classes after fine-tuning. In particular, we propose to minimize the feature deviation of novel textual labels (instead of base classes) sampled from a large vocabulary. In effect, DOR prevents the increase in textual divergence for new labels while easing restrictions on base classes. Extensive experiments demonstrate that DOR can enhance the calibration performance of current fine-tuning methods on base and new classes.

Paperid:2452 poster

Authors:Tuan Truong, Chau Nguyen, Huy Nguyen, Minh Le, Trung Le, Nhat Ho

Title: RepLoRA: Reparameterizing Low-rank Adaptation via the Perspective of Mixture of Experts

Abstract:
Low-rank Adaptation (LoRA) has emerged as a powerful and efficient method for fine-tuning large-scale foundation models. Despite its popularity, the theoretical understanding of LoRA has remained underexplored. In this paper, we present a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models. Under this framework, we show that a simple technique, reparameterizing LoRA matrices, can notably accelerate the low-rank matrix estimation process. In particular, we prove that reparameterization can reduce the data needed to achieve a desired estimation error from an exponential to a polynomial scale. Motivated by this insight, we proposeReparameterizedLow-RankAdaptation (RepLoRA), incorporating a lightweight MLP to reparameterize the LoRA matrices. Extensive experiments across multiple domains demonstrate that RepLoRA consistently outperforms vanilla LoRA. With limited data, RepLoRA surpasses LoRA by a substantial margin of up to40.0%and achieves LoRA's performance using only30.0%of the training data, highlighting the theoretical and empirical robustness of our PEFT method.

Paperid:2453 poster

Authors:Zhenqiao Song, Tianxiao Li, Lei Li, Martin Renqiang Min

Title: PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design

Abstract:
Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiff builds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations,k-nearest neighbor (kNN) equivariant graph convolutional layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBench and finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiff consistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively.

Paperid:2454 poster

Authors:Jesse He, Helen Jenne, Herman Chau, Davis Brown, Mark Raugas, Sara C. Billey, Henry Kvinge

Title: Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes

Abstract:
Machine learning is becoming an increasingly valuable tool in mathematics, enabling one to identify subtle patterns across collections of examples so vast that they would be impossible for a single researcher to feasibly review and analyze. In this work, we use graph neural networks to investigate quiver mutation---an operation that transforms one quiver (or directed multigraph) into another---which is central to the theory of cluster algebras with deep connections to geometry, topology, and physics. In the study of cluster algebras, the question of mutation equivalence is of fundamental concern: given two quivers, can one efficiently determine if one quiver can be transformed into the other through a sequence of mutations? In this paper, we use graph neural networks and AI explainability techniques to independently discover mutation equivalence criteria for quivers of typeD~. Along the way, we also show that even without explicit training to do so, our model captures structure within its hidden representation that allows us to reconstruct known criteria from typeD, adding to the growing evidence that modern machine learning models are capable of learning abstract and general rules from mathematical data.

Paperid:2455 poster

Authors:Yicheng Pan, Renjie Chen, Pengyu Long, Bingchen Fan

Title: Hierarchical Overlapping Clustering on Graphs: Cost Function, Algorithm and Scalability

Abstract:
Overlap and hierarchy are two prevalent phenomena in clustering, and usually coexist in a single system. There are several studies on each of them separately, but it is unclear how to characterize and evaluate the hybrid structures yet. To address this issue, we initiate the study of hierarchical overlapping clustering on graphs by introducing a new cost function for it. We show the rationality of our cost function via several intuitive properties, and develop an approximation algorithm that achieves a constant approximation factor for its dual version. Our algorithm is a recursive process of overlapping bipartition based on local search, which makes a speed-up version of it extremely scalable. Our experiments demonstrate that the speed-up algorithm has significantly better performances than all the baseline methods in both effectiveness and scalability on synthetic and real datasets.

Paperid:2456 poster

Authors:Kangjie Chen, Li Muyang, Guanlin Li, Shudong Zhang, Shangwei Guo, Tianwei Zhang

Title: TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models

Abstract:
Vision-Language Models (VLMs) have become a cornerstone in multi-modal artificial intelligence, enabling seamless integration of visual and textual information for tasks such as image captioning, visual question answering, and cross-modal retrieval. Despite their impressive capabilities, these models often exhibit inherent vulnerabilities that can lead to safety failures in critical applications. Red-teaming is an important approach to identify and test system's vulnerabilities, but how to conduct red-teaming for contemporary VLMs is an unexplored area. In this paper, we propose a novel multi-modal red-teaming approach, TRUST-VLM, to enhance both the attack success rate and the diversity of successful test cases for VLMs. Specifically, TRUST-VLM is built upon the in-context learning to adversarially test a VLM on both image and text inputs. Furthermore, we involve feedback from the target VLM to improve the efficiency of test case generation. Extensive experiments show that TRUST-VLM not only outperforms traditional red-teaming techniques in generating diverse and effective adversarial cases but also provides actionable insights for model improvement. These findings highlight the importance of advanced red-teaming strategies in ensuring the reliability of VLMs.

Paperid:2457 poster

Authors:Ni Mu, Hao Hu, Xiao Hu, Yiqin Yang, Bo XU, Qing-Shan Jia

Title: CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries

Abstract:
Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL’s real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.

Paperid:2458 poster

Authors:Hanting Wang, Tao Jin, Wang Lin, Shulei Wang, Hai Huang, Shengpeng Ji, Zhou Zhao

Title: IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models

Abstract:
Bridge models in image restoration construct a diffusion process from degraded to clear images. However, existing methods typically require training a bridge model from scratch for each specific type of degradation, resulting in high computational costs and limited performance. This work aims to efficiently leverage pretrained generative priors within existing image restoration bridges to eliminate this requirement. The main challenge is that standard generative models are typically designed for a diffusion process that starts from pure noise, while restoration tasks begin with a low-quality image, resulting in a mismatch in the state distributions between the two processes. To address this challenge, we propose a transition equation that bridges two diffusion processes with the same endpoint distribution. Based on this, we introduce theIRBridgeframework, which enables the direct utilization of generative models within image restoration bridges, offering a more flexible and adaptable approach to image restoration. Extensive experiments on six image restoration tasks demonstrate that IRBridge efficiently integrates generative priors, resulting in improved robustness and generalization performance. Code will be available at GitHub.

Paperid:2459 poster

Authors:Yunzhen Yao, Lie He, Michael Gastpar

Title: Leveraging Sparsity for Sample-Efficient Preference Learning: A Theoretical Perspective

Abstract:
This paper considers the sample-efficiency of preference learning, which models and predicts human choices based on comparative judgments. The minimax optimal estimation error rateΘ(d/n)in classical estimation theory requires that the number of samplesnscales linearly with the dimensionality of the feature spaced. However, the high dimensionality of the feature space and the high cost of collecting human-annotated data challenge the efficiency of traditional estimation methods. To remedy this, we leverage sparsity in the preference model and establish sharp error rates. We show that under the sparse random utility model, where the parameter of the reward function isk-sparse, the minimax optimal rate can be reduced toΘ(k/nlog�?d/k)). Furthermore, we analyze the�?-regularized estimator and show that it achieves near-optimal rate under mild assumptions on the Gram matrix. Experiments on synthetic data and LLM alignment data validate our theoretical findings, showing that sparsity-aware methods significantly reduce sample complexity and improve prediction accuracy.

Paperid:2460 poster

Authors:Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang

Title: Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

Abstract:
Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available athttps://github.com/Playmate111/Playmate.

Paperid:2461 poster

Authors:Gengshuo Chang, Wei Zhang, Lehan Zhang

Title: Matrix Completion with Incomplete Side Information via Orthogonal Complement Projection

Abstract:
Matrix completion aims to recover missing entries in a data matrix using a subset of observed entries. Previous studies show that side information can greatly improve completion accuracy, but most assume perfect side information, which is rarely available in practice. In this paper, we propose an orthogonal complement matrix completion (OCMC) model to address the challenge of matrix completion with incomplete side information. The model leverages the orthogonal complement projection derived from the available side information, generalizing the traditional perfect side information matrix completion to the scenarios with incomplete side information. Moreover, using probably approximately correct (PAC) learning theory, we show that the sample complexity of OCMC model decreases quadratically with the completeness level. To efficiently solve the OCMC model, a linearized Lagrangian algorithm is developed with convergence guarantees. Experimental results show that the proposed OCMC model outperforms state-of-the-art methods on both synthetic data and real-world applications.

Paperid:2462 poster

Authors:Hoigi Seo, Wongi Jeong, Jae-sun Seo, Se Young Chun

Title: Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Abstract:
Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

Paperid:2463 poster

Authors:Pengchong Hu, Zhizhong Han

Title: VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians

Abstract:
Jointly estimating camera poses and mapping scenes from RGBD images is a fundamental task in simultaneous localization and mapping (SLAM). State-of-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering. However, these methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies that need to optimize all 3D Gaussians in the limited GPU memories throughout the training to maintain the geometry and color consistency to previous RGBD observations. To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians, for RGBD SLAM systems. View-tied 3D Gaussians is a kind of simplified Gaussians, which is tied to depth pixels, without needing to learn locations, rotations, and multi-dimensional variances. Tying Gaussians to views not only significantly saves storage but also allows us to employ many more Gaussians to represent local details in the limited GPU memory. Moreover, our strategies remove the need of maintaining all Gaussians learnable throughout the training, while improving rendering quality, and tracking accuracy. We justify the effectiveness of these designs, and report better performance over the latest methods on the widely used benchmarks in terms of rendering and tracking accuracy and scalability. Please see our project page for code and videos athttps://machineperceptionlab.github.io/VTGaussian-SLAM-Project.

Paperid:2464 poster

Authors:Bishoy Galoaa, Somaieh Amraee, Sarah Ostadabbas

Title: More Than Meets the Eye: Enhancing Multi-Object Tracking Even with Prolonged Occlusions

Abstract:
This paper introduces MOTE (MOre Than meets the Eye), a novel multi-object tracking (MOT) algorithm designed to address the challenges of tracking occluded objects. By integrating deformable detection transformers with a custom disocclusion matrix, MOTE significantly enhances the ability to track objects even when they are temporarily hidden from view. The algorithm leverages optical flow to generate features that are processed through a softmax splatting layer, which aids in the creation of a disocclusion matrix. This matrix plays a crucial role in maintaining track consistency by estimating the motion of occluded objects. MOTE's architecture includes modifications to the enhanced track embedding module (ETEM), which allows it to incorporate these advanced features into the track query layer embeddings. This integration ensures that the model not only tracks visible objects but also accurately predicts the trajectories of occluded ones, much like the human visual system. The proposed method is evaluated on multiple datasets, including MOT17, MOT20, and DanceTrack, where it achieves impressive tracking metrics--82.0 MOTA and 66.3 HOTA on the MOT17 dataset, 81.7 MOTA and 65.8 HOTA on the MOT20 dataset, and 93.2 MOTA and 74.2 HOTA on the DanceTrack dataset. Notably, MOTE excels in reducing identity switches and maintaining consistent tracking in complex real-world scenarios with frequent occlusions, outperforming existing state-of-the-art methods across all tested benchmarks.

Paperid:2465 poster

Authors:Thibaud Southiratn, Bonil Koo, Yijingxiu Lu, Sun Kim

Title: CombiMOTS: Combinatorial Multi-Objective Tree Search for Dual-Target Molecule Generation

Abstract:
Dual-target molecule generation, which focuses on discovering compounds capable of interacting with two target proteins, has garnered significant attention due to its potential for improving therapeutic efficiency, safety and resistance mitigation. Existing approaches face two critical challenges. First, by simplifying the complex dual-target optimization problem to scalarized combinations of individual objectives, they fail to capture important trade-offs between target engagement and molecular properties. Second, they typically do not integrate synthetic planning into the generative process. This highlights a need for more appropriate objective function design and synthesis-aware methodologies tailored to the dual-target molecule generation task. In this work, we propose CombiMOTS, a Pareto Monte Carlo Tree Search (PMCTS) framework that generates dual-target molecules. CombiMOTS is designed to explore a synthesizable fragment space while employing vectorized optimization constraints to encapsulate target affinity and physicochemical properties. Extensive experiments on real-world databases demonstrate that CombiMOTS produces novel dual-target molecules with high docking scores, enhanced diversity, and balanced pharmacological characteristics, showcasing its potential as a powerful tool for dual-target drug discovery. The code and data is accessible through \url{https://github.com/Tibogoss/CombiMOTS}.

Paperid:2466 poster

Authors:Jing Huang, Junyi Tao, Thomas Icard, Diyi Yang, Christopher Potts

Title: Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Abstract:
Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples? In this work, we provide a positive answer to this question. Through a diverse set of language modeling tasks—including symbol manipulation, knowledge retrieval, and instruction following—we show that the most robust features for correctness prediction are those that play a distinctive causal role in the model’s behavior. Specifically, we propose two methods that leverage causal mechanisms to predict the correctness of model outputs: counterfactual simulation (checking whether key causal variables are realized) and value probing (using the values of those variables to make predictions). Both achieve high AUC-ROC in distribution and outperform methods that rely on causal-agnostic features in out-of-distribution settings, where predicting model behaviors is more crucial. Our work thus highlights a novel and significant application for internal causal analysis of language models.

Paperid:2467 poster

Authors:Xiangheng Wang, Ziquan Fang, Chenglong Huang, Danlei Hu, Lu Chen, Yunjun Gao

Title: GTR: A General, Multi-View, and Dynamic Framework for Trajectory Representation Learning

Abstract:
Trajectory representation learning aims to transform raw trajectory data into compact and low-dimensional vectors that are suitable for downstream analysis. However, most existing methods adopt either a free-space view or a road-network view during the learning process, which limits their ability to capture the complex, multi-view spatiotemporal features inherent in trajectory data. Moreover, these approaches rely on task-specific model training, restricting their generalizability and effectiveness for diverse analysis tasks. To this end, we propose GTR, a general, multi-view, and dynamic Trajectory Representation framework built on a pre-train and fine-tune architecture. Specifically, GTR introduces a multi-view encoder that captures the intrinsic multi-view spatiotemporal features. Based on the pre-train and fine-tune architecture, we provide the spatio-temporal fusion pre-training with a spatio-temporal mixture of experts to dynamically combine spatial and temporal features, enabling seamless adaptation to diverse trajectory analysis tasks. Furthermore, we propose an online frozen-hot updating strategy to efficiently update the representation model, accommodating the dynamic nature of trajectory data. Extensive experiments on two real-world datasets demonstrate that GTR consistently outperforms 15 state-of-the-art methods across 6 mainstream trajectory analysis tasks. All source code and data are available athttps://github.com/ZJU-DAILY/GTR.

Paperid:2468 poster

Authors:Su Jia, Peter I. Frazier, Nathan Kallus

Title: Multi-Armed Bandits with Interference: Bridging Causal Inference and Adversarial Bandits

Abstract:
Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. Cumulative performance, while equally important, is less well understood. To address this gap, we introduce the problem of Multi-armed Bandits with Interference (MABI), where the learner assigns an arm to each of $N$ experimental units over $T$ rounds. The reward of each unit in each round depends on the treatments of all units, where the interference between two units decays in their distance. The reward functions are chosen by an adversary and may vary arbitrarily over time and across different units. We first show that the optimal expected regret (against the best fixed-arm policy) is $\tilde O(\sqrt T)$, and can be achieved by a switchback policy. However, the regret (as a random variable) for any switchback policy suffers a high variance, since it does not account for $N$. We propose a policy based on a novel clustered randomization scheme, whose regret (i) is optimal in expectation and (ii) admits a high probability bound that vanishes in $N$.

Paperid:2469 poster

Authors:Benjamin Samuel Ruben, William Lingxiao Tong, Hamza Tahir Chaudhry, Cengiz Pehlevan

Title: No Free Lunch from Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions

Abstract:
Given a fixed budget for total model size, one must choose between training a single large model or combining the predictions of multiple smaller models. We investigate this trade-off for ensembles of random-feature ridge regression models in both the overparameterized and underparameterized regimes. Using deterministic equivalent risk estimates, we prove that when a fixed number of parameters is distributed among $K$ independently trained models, the ridge-optimized test risk increases with $K$. Consequently, a single large model achieves optimal performance. We then ask when ensembles can achievenear-optimal performance. In the overparameterized regime, we show that, to leading order, the test error depends on ensemble size and model size only through the total feature count, so that overparameterized ensembles consistently achieve near-optimal performance. To understand underparameterized ensembles, we derive scaling laws for the test risk as a function of total parameter count when the ensemble size and parameters per ensemble member are jointly scaled according to a ``growth exponent'' $\ell$. While the optimal error scaling is always achieved by increasing model size with a fixed ensemble size, our analysis identifies conditions on the kernel and task eigenstructure under which near-optimal scaling laws can be obtained by joint scaling of ensemble size and model size.

Paperid:2470 poster

Authors:Zelin Xu, Yupu Zhang, Tingsong Xiao, Maitane Olabarrieta Lizaso, Jose M. Gonzalez-Ondina, Zibo Liu, Shigang Chen, Zhe Jiang

Title: CoastalBench: A Decade-Long High-Resolution Dataset to Emulate Complex Coastal Processes

Abstract:
Over 40% of the global population lives within 100 kilometers of the coast, which contributes more than $8 trillion annually to the global economy. Unfortunately, coastal ecosystems are increasingly vulnerable to more frequent and intense extreme weather events and rising sea levels. Coastal scientists use numerical models to simulate complex physical processes, but these models are often slow and expensive. In recent years, deep learning has become a promising alternative to reduce the cost of numerical models. However, progress has been hindered by the lack of a large-scale, high-resolution coastal simulation dataset to train and validate deep learning models. Existing studies often focus on relatively small datasets and simple processes. To fill this gap, we introduce a decade-long, high-resolution (<100m) coastal circulation modeling dataset on a real-world 3D mesh in southwest Florida with around 6 million cells. The dataset contains key oceanography variables (e.g., current velocities, free surface level, temperature, salinity) alongside external atmospheric and river forcings. We evaluated a customized Vision Transformer model that takes initial and boundary conditions and external forcings and predicts ocean variables at varying lead times. The dataset provides an opportunity to benchmark novel deep learning models for high-resolution coastal simulations (e.g., physics-informed machine learning, neural operator learning). The code and dataset can be accessed athttps://github.com/spatialdatasciencegroup/CoastalBench.

Paperid:2471 poster

Authors:Yudong Xu, Wenhao Li, Scott Sanner, Elias Boutros Khalil

Title: Self-Supervised Transformers as Iterative Solution Improvers for Constraint Satisfaction

Abstract:
We present a Transformer-based framework for Constraint Satisfaction Problems (CSPs). CSPs find use in many applications and thus accelerating their solution with machine learning is of wide interest. Most existing approaches rely on supervised learning from feasible solutions or reinforcement learning, paradigms that require either feasible solutions to these NP-Complete CSPs or large training budgets and a complex expert-designed reward signal. To address these challenges, we propose ConsFormer, a self-supervised framework that leverages a Transformer as a solution refiner. ConsFormer constructs a solution to a CSP iteratively in a process that mimics local search. Instead of using feasible solutions as labeled data, we devise differentiable approximations to the discrete constraints of a CSP to guide model training. Our model is trained to improve random assignments for a single step but is deployed iteratively at test time, circumventing the bottlenecks of supervised and reinforcement learning. Experiments on Sudoku, Graph Coloring, Nurse Rostering, and MAXCUT demonstrate that our method can tackle out-of-distribution CSPs simply through additional iterations.

Paperid:2472 poster

Authors:Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao

Title: MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines

Abstract:
Large Language Models (LLMs) have demonstrated the ability to solve a wide range of practical tasks within multi-agent systems. However, existing human-designed multi-agent frameworks are typically limited to a small set of pre-defined scenarios, while current automated design methods suffer from several limitations, such as the lack of tool integration, dependence on external training data, and rigid communication structures. In this paper, we propose \textbf{MetaAgent}, a \textbf{finite state machine} based framework that can automatically generate a multi-agent system. Given a task description, MetaAgent will design a multi-agent system and polish it through an optimization algorithm. When the multi-agent system is deployed, the finite state machine will control the agent's actions and the state transitions. To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks.

Paperid:2473 poster

Authors:Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreacic, Yifan Li, Xiang Yue, Bo Li, Vamsi K. Potluru, Pan Li, Eli Chien

Title: Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

Abstract:
Large Language Models (LLMs) embed sensitive, human-generated data, prompting the need for unlearning methods. Although certified unlearning offers strong privacy guarantees, its restrictive assumptions make it unsuitable for LLMs, giving rise to various heuristic approaches typically assessed through empirical evaluations. These standard evaluations randomly select data for removal, apply unlearning techniques, and use membership inference attacks (MIAs) to compare unlearned models against models retrained without the removed data. However, to ensure robust privacy protections for every data point, it is essential to account for scenarios in which certain data subsets face elevated risks. Prior research suggests that outliers, particularly including data tied to minority groups, often exhibit higher memorization propensity which indicates they may be more difficult to unlearn. Building on these insights, we introduce a complementary, minority-aware evaluation framework to highlight blind spots in existing frameworks. We substantiate our findings with carefully designed experiments, using canaries with personally identifiable information (PII) to represent these minority subsets and demonstrate that they suffer at least 20% higher privacy leakage across various unlearning methods, MIAs, datasets, and LLM scales. Our proposed minority-aware evaluation framework marks an essential step toward more equitable and comprehensive assessments of LLM unlearning efficacy.

Paperid:2474 poster

Authors:Jingfeng Wu, Peter Bartlett, Matus Telgarsky, Bin Yu

Title: Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Abstract:
In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution---a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.

Paperid:2475 poster

Authors:Simone Drago, Marco Mussi, Alberto Maria Metelli

Title: Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Abstract:
The success of sequential decision-making approaches, such asreinforcement learning(RL), is closely tied to the availability of a reward feedback. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with preference feedback provided, for instance, by a human expert. We aim to build a theoretical basis linkingpreferences, (non-Markovian)utilities, and (Markovian)rewards, and we study the connections between them. First, we model preference feedback using a partial (pre)order over trajectories, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in existing works. Second, to provide a theoretical justification for a common practice, we investigate how a preference relation can be approximated by a multi-objective utility. We introduce a notion of preference-utility compatibility and analyze the computational complexity of this transformation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy dominance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Markovian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feedback, with promising potential applications in RL from human feedback.

Paperid:2476 poster

Authors:Naram Mhaisen, George Iosifidis

Title: On the Dynamic Regret of Following the Regularized Leader: Optimism with History Pruning

Abstract:
We revisit the Follow the Regularized Leader (FTRL) framework for Online Convex Optimization (OCO) over compact sets, focusing on achieving dynamic regret guarantees. Prior work has highlighted the framework’s limitations in dynamic environments due to its tendency to produce "lazy" iterates. However, building on insights showing FTRL's ability to produce "agile" iterates, we show that it can indeed recover known dynamic regret bounds through optimistic composition of future costs and careful linearization of past costs, which can lead to pruning some of them. This new analysis of FTRL against dynamic comparators yields a principled way to interpolate between greedy and agile updates and offers several benefits, including refined control over regret terms, optimism without cyclic dependence, and the application of minimal recursive regularization akin to AdaFTRL. More broadly, we show that it is not the "lazy" projection style of FTRL that hinders (optimistic) dynamic regret, but the decoupling of the algorithm’s state (linearized history) from its iterates, allowing the state to grow arbitrarily. Instead, pruning synchronizes these two when necessary.

Paperid:2477 poster

Authors:Cansu Sancaktar, Christian Gumbsch, Andrii Zadaianchuk, Pavel Kolev, Georg Martius

Title: SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models

Abstract:
Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task-based rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low-level interactions. In contrast, children’s play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language-embedded environments or access to high-level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model-based RL agents with an intrinsic motivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model-based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game-like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low-level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful.

Paperid:2478 poster

Authors:Chaitanya K. Joshi, Xiang Fu, Yi-Lun Liao, Vahe Gharakhanyan, Benjamin Kurt Miller, Anuroop Sriram, Zachary Ward Ulissi

Title: All-atom Diffusion Transformers: Unified generative modelling of molecules and materials

Abstract:
Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems -- such as molecules and materials -- the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the All-atom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on MP20, QM9 and GEOM-DRUGS datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, obtaining state-of-the-art results on par with molecule and crystal-specific models. ADiT uses standard Transformers with minimal inductive biases for both the autoencoder and diffusion model, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code:https://github.com/facebookresearch/all-atom-diffusion-transformer

Paperid:2479 poster

Authors:Wenxu Wang, Rui Zhou, Jing Wang, Yun Zhou, Cheng Zhu, Ruichun Tang, Bo Han, Nevin L. Zhang

Title: COSDA: Counterfactual-based Susceptibility Risk Framework for Open-Set Domain Adaptation

Abstract:
Open-Set Domain Adaptation (OSDA) aims to transfer knowledge from the labeled source domain to the unlabeled target domain that contains unknown categories, thus facing the challenges of domain shift and unknown category recognition. While recent works have demonstrated the potential of causality for domain alignment, little exploration has been conducted on causal-inspired theoretical frameworks for OSDA. To fill this gap, we introduce the concept ofSusceptibilityand propose a novelCounterfactual-based susceptibility risk framework forOSDA, termedCOSDA. Specifically, COSDA consists of three novel components: (i) aSusceptibility Risk Estimator (SRE)for capturing causal information, along with comprehensive derivations of the computable theoretical upper bound, forming a risk minimization framework under the OSDA paradigm; (ii) aContrastive Feature Alignment (CFA)module, which is theoretically proven based on mutual information to satisfy theExogeneityassumption and facilitate cross-domain feature alignment; (iii) aVirtual Multi-unknown-categories Prototype (VMP)pseudo-labeling strategy, providing label information by measuring how similar samples are to known and multiple virtual unknown category prototypes, thereby assisting in open-set recognition and intra-class discriminative feature learning. Extensive experiments demonstrate that our approach achieves state-of-the-art performance.

Paperid:2480 poster

Authors:Haowen Ma, Zhiguo Long, Hua Meng

Title: TANGO: Clustering with Typicality-Aware Nonlocal Mode-Seeking and Graph-Cut Optimization

Abstract:
Density-based mode-seeking methods generate a density-ascending dependency from low-density points towards higher-density neighbors. Current mode-seeking methods identify modes by breaking some dependency connections, but relying heavily on local data characteristics, requiring case-by-case threshold settings or human intervention to be effective for different datasets. To address this issue, we introduce a novel concept called typicality, by exploring the locally defined dependency from a global perspective, to quantify how confident a point would be a mode. We devise an algorithm that effectively and efficiently identifies modes with the help of the global-view typicality. To implement and validate our idea, we design a clustering method called TANGO, which not only leverages typicality to detect modes, but also utilizes graph-cut with an improved path-based similarity to aggregate data into the final clusters. Moreover, this paper also provides some theoretical analysis on the proposed algorithm. Experimental results on several synthetic and extensive real-world datasets demonstrate the effectiveness and superiority of TANGO. The code is available athttps://github.com/SWJTU-ML/TANGO_code.

Paperid:2481 poster

Authors:Yujun Shi, Jun Hao Liew, Hanshu Yan, Vincent Y. F. Tan, Jiashi Feng

Title: LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos

Abstract:
Accuracy and speed are critical in image editing tasks. Pan et al. introduced a drag-based framework using Generative Adversarial Networks, and subsequent studies have leveraged large-scale diffusion models. However, these methods often require over a minute per edit and exhibit low success rates. We present LightningDrag, which achieves high-quality drag-based editing in about one second on general images. By redefining drag-based editing as a conditional generation task, we eliminate the need for time-consuming latent optimization or gradient-based guidance. Our model is trained on large-scale paired video frames, capturing diverse motion (object translations, pose shifts, zooming, etc.) to significantly improve accuracy and consistency. Despite being trained only on videos, our model generalizes to local deformations beyond the training data (e.g., lengthening hair, twisting rainbows). Extensive evaluations confirm the superiority of our approach, and we will release both code and model.

Paperid:2482 poster

Authors:Yu-Jie Zhang, Peng Zhao, Masashi Sugiyama

Title: Non-stationary Online Learning for Curved Losses: Improved Dynamic Regret via Mixability

Abstract:
Non-stationary online learning has drawn much attention in recent years. Despite considerable progress, dynamic regret minimization has primarily focused on convex functions, leaving the functions with stronger curvature (e.g., squared or logistic loss) underexplored. In this work, we address this gap by showing that the regret can be substantially improved by leveraging the concept of mixability, a property that generalizes exp-concavity to effectively capture loss curvature. Let $d$ denote the dimensionality and $P_T$ the path length of comparators that reflects the environmental non-stationarity. We demonstrate that an exponential-weight method with fixed-share updates achieves an $\mathcal{O}(d T^{1/3} P_T^{2/3} \log T)$ dynamic regret for mixable losses, improving upon the best-known $\mathcal{O}(d^{10/3} T^{1/3} P_T^{2/3} \log T)$ result (Baby & Wang, 2021) in $d$. More importantly, this improvement arises from a simple yet powerful analytical framework that exploits the mixability, which avoids the Karush–Kuhn–Tucker-based analysis required by existing work.

Paperid:2483 poster

Authors:Yong Liang Goh, Zhiguang Cao, Yining Ma, Jianan Zhou, Mohammed Haroon Dupty, Wee Sun Lee

Title: SHIELD: Multi-task Multi-distribution Vehicle Routing Solver with Sparsity and Hierarchy

Abstract:
Recent advances toward foundation models for routing problems have shown great potential of a unified deep model for various VRP variants. However, they overlook the complex real-world customer distributions. In this work, we advance the Multi-Task VRP (MTVRP) setting to the more realistic yet challenging Multi-Task Multi-Distribution VRP (MTMDVRP) setting, and introduce SHIELD, a novel model that leverages bothsparsityandhierarchyprinciples. Building on a deeper decoder architecture, we first incorporate the Mixture-of-Depths (MoD) technique to enforce sparsity. This improves both efficiency and generalization by allowing the model to dynamically select nodes to use or skip each decoder layer, providing the needed capacity to adaptively allocate computation for learning the task/distribution specific and shared representations. We also develop a context-based clustering layer that exploits the presence of hierarchical structures in the problems to produce better local representations. These two designs inductively bias the network to identify key features that are common across tasks and distributions, leading to significantly improved generalization on unseen ones. Our empirical results demonstrate the superiority of our approach over existing methods on 9 real-world maps with 16 VRP variants each.

Paperid:2484 poster

Authors:Boqi Li, Youjun Wang, Weiwei Liu

Title: Towards Understanding Catastrophic Forgetting in Two-layer Convolutional Neural Networks

Abstract:
Continual learning (CL) focuses on the ability of models to learn sequentially from a stream of tasks. A major challenge in CL is catastrophic forgetting (CF). CF is a phenomenon where the model experiences significant performance degradation on previously learned tasks after training on new tasks. Although CF is commonly observed in convolutional neural networks (CNNs), the theoretical understanding about CF within CNNs remains limited. To fill the gap, we present a theoretical analysis of CF in a two-layer CNN. By employing a multi-view data model, we analyze the learning dynamics of different features throughout CL and derive theoretical insights. The findings are supported by empirical results from both simulated and real-world datasets.

Paperid:2485 poster

Authors:Eric Eaton, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

Title: Intersectional Fairness in Reinforcement Learning with Large State and Constraint Spaces

Abstract:
In traditional reinforcement learning (RL), the learner aims to solve a single objective optimization problem: find the policy that maximizes expected reward. However, in many real-world settings, it is important to optimize over multiple objectives simultaneously. For example, when we are interested in fairness, states might have feature annotations corresponding to multiple (intersecting) demographic groups to whom reward accrues, and our goal might be to maximize the reward of the group receiving the minimal reward. In this work, we consider a multi-objective optimization problem in which each objective is defined by a state-based reweighting of a single scalar reward function. This generalizes the problem of maximizing the reward of the minimum reward group. We provide oracle-efficient algorithms to solve these multi-objective RL problems even when the number of objectives is very large --- for tabular MDPs, as well as for large MDPs when the group functions have additional structure. The contribution of this paper is that we are able to solve this class of multi-objective RL problems with a possiblyexponentiallylarge class of constraints over intersecting groups in both tabular and large state space MDPs in an oracle-efficient manner. Finally, we experimentally validate our theoretical results and demonstrate applications on a preferential attachment graph MDP.

Paperid:2486 poster

Authors:Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, JUN ZHOU, Yanfang Ye, Chuxu Zhang

Title: MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

Abstract:
High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a Mathematical data Selection framework using the Skill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50% to 70% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3% to 5.9%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.

Paperid:2487 poster

Authors:Belle Liu, Jacob Sacks, Matthew D. Golub

Title: Accurate Identification of Communication Between Multiple Interacting Neural Populations

Abstract:
Neural recording technologies now enable simultaneous recording of population activity across multiple brain regions, motivating the development of data-driven models of communication between recorded brain regions. Existing models can struggle to disentangle communication from the effects of unrecorded regions and local neural population dynamics. Here, we introduce Multi-Region Latent Factor Analysis via Dynamical Systems (MR-LFADS), a sequential variational autoencoder composed of region-specific recurrent networks. MR-LFADS features structured information bottlenecks, data-constrained communication, and unsupervised inference of unobserved inputs--features that specifically support disentangling of inter-regional communication, inputs from unobserved regions, and local population dynamics. MR-LFADS outperforms existing approaches at identifying communication across dozens of simulations of task-trained multi-region networks. Applied to large-scale electrophysiology, MR-LFADS predicts brain-wide effects of circuit perturbations that were not seen during model fitting. These validations on synthetic and real neural data suggest that MR-LFADS could serve as a powerful tool for uncovering the principles of brain-wide information processing.

Paperid:2488 poster

Authors:Jumman Hossain, Nirmalya Roy

Title: QPRL : Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal

Abstract:
Reinforcement learning (RL) in real-world tasks such as robotic navigation often encounters environments with asymmetric traversal costs, where actions like climbing uphill versus moving downhill incur distinctly different penalties, or transitions may become irreversible. While recent quasimetric RL methods relax symmetry assumptions, they typically do not explicitly account for path-dependent costs or provide rigorous safety guarantees. We introduce Quasi-Potential Reinforcement Learning (QPRL), a novel framework that explicitly decomposes asymmetric traversal costs into a path-independent potential function (Φ) and a path-dependent residual (Ψ). This decomposition allows efficient learning and stable policy optimization via a Lyapunov-based safety mechanism. Theoretically, we prove that QPRL achieves convergence with improved sample complexity ofO~(T), surpassing prior quasimetric RL bounds ofO~(T). Empirically, our experiments demonstrate that QPRL attains state-of-the-art performance across various navigation and control tasks, significantly reducing irreversible constraint violations by approximately4×compared to baselines.

Paperid:2489 poster

Authors:Takuya Koriyama, Pierre C Bellec

Title: Phase transitions for the existence of unregularized M-estimators in single index models

Abstract:
This paper studies phase transitions for the existence of unregularized M-estimators under proportional asymptotics where the sample sizenand feature dimensionpgrow proportionally withn/p→δ∈(1,�?. We study the existence of M-estimators in single-index models where the responseyidepends on covariatesxi∼N(0,Ip)through an unknown indexw∈Rpand an unknown link function. An explicit expression is derived for the critical thresholdδ∞that determines the phase transition for the existence of the M-estimator, generalizing the results of Candés & Sur (2020) for binary logistic regression to other single-index models. Furthermore, we investigate the existence of a solution to the nonlinear system of equations governing the asymptotic behavior of the M-estimator when it exists. The existence of solution to this system forδ>δ∞remains largely unproven outside the global null in binary logistic regression. We address this gap with a proof that the system admits a solution if and only ifδ>δ�? providing a comprehensive theoretical foundation for proportional asymptotic results that require as a prerequisite the existence of a solution to the system.

Paperid:2490 poster

Authors:Langzhang Liang, Fanchen Bu, Zixing Song, Zenglin Xu, Shirui Pan, Kijung Shin

Title: Mitigating Over-Squashing in Graph Neural Networks by Spectrum-Preserving Sparsification

Abstract:
The message-passing paradigm of Graph Neural Networks often struggles with exchanging information across distant nodes typically due to structural bottlenecks in certain graph regions, a limitation known as over-squashing. To reduce such bottlenecks, graph rewiring, which modifies graph topology, has been widely used. However, existing graph rewiring techniques often overlook the need to preserve critical properties of the original graph, e.g., spectral properties. Moreover, many approaches rely on increasing edge count to improve connectivity, which introduces significant computational overhead and exacerbates the risk of over-smoothing. In this paper, we propose a novel graph-rewiring method that leverages spectral graph sparsification for mitigating over-squashing. Specifically, our method generates graphs with enhanced connectivity while maintaining sparsity and largely preserving the original graph spectrum, effectively balancing structural bottleneck reduction and graph property preservation. Experimental results validate the effectiveness of our approach, demonstrating its superiority over strong baseline methods in classification accuracy and retention of the Laplacian spectrum.

Paperid:2491 poster

Authors:Chao Huang, Yushu Shi, Jie Wen, Wei Wang, Yong Xu, Xiaochun Cao

Title: Ex-VAD: Explainable Fine-grained Video Anomaly Detection Based on Visual-Language Models

Abstract:
With advancements in visual language models (VLMs) and large language models (LLMs), video anomaly detection (VAD) has progressed beyond binary classification to fine-grained categorization and multidimensional analysis. However, existing methods focus mainly on coarse-grained detection, lacking anomaly explanations. To address these challenges, we propose Ex-VAD, an Explainable Fine-grained Video Anomaly Detection approach that combines fine-grained classification with detailed explanations of anomalies. First, we use a VLM to extract frame-level captions, and an LLM converts them to video-level explanations, enhancing the model's explainability. Second, integrating textual explanations of anomalies with visual information greatly enhances the model's anomaly detection capability. Finally, we apply label-enhanced alignment to optimize feature fusion, enabling precise fine-grained detection. Extensive experimental results on the UCF-Crime and XD-Violence datasets demonstrate that Ex-VAD significantly outperforms existing State-of-The-Art methods.

Paperid:2492 poster

Authors:Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein

Title: LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

Abstract:
In this paper, we present a benchmark to pressure-test today’s frontier models�?multimodal decision-making capabilities in the very long-context regime (up to one million tokens) and investigate whether these models can learn from large numbers of expert demonstrations in their context. We evaluate the performance of Claude 3.5 Sonnet, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash Experimental, GPT-4o, o1-mini, o1-preview, and o1 as policies across a battery of simple interactive decision-making tasks: playing tic-tac-toe, chess, and Atari, navigating grid worlds, solving crosswords, and controlling a simulated cheetah. We study increasing amounts of expert demonstrations in the context �?from no demonstrations to 512 full episodes. Across our tasks, models rarely manage to fully reach expert performance, and often, presenting more demonstrations has little effect. Some models steadily improve with more demonstrations on a few tasks. We investigate the effect of encoding observations as text or images and the impact of chain-of-thought prompting. To help quantify the impact of other approaches and future innovations, we open source our benchmark that covers the zero-, few-, and many-shot regimes in a unified evaluation.

Paperid:2493 poster

Authors:Yifan Hu, Guibin Zhang, Peiyuan Liu, Disen Lan, Naiqi Li, Dawei Cheng, Tao Dai, Shu-Tao Xia, Shirui Pan

Title: TimeFilter: Patch-Specific Spatial-Temporal Graph Filtration for Time Series Forecasting

Abstract:
Time series forecasting methods generally fall into two main categories: Channel Independent (CI) and Channel Dependent (CD) strategies. While CI overlooks important covariate relationships, CD captures all dependencies without distinction, introducing noise and reducing generalization. Recent advances in Channel Clustering (CC) aim to refine dependency modeling by grouping channels with similar characteristics and applying tailored modeling techniques. However, coarse-grained clustering struggles to capture complex, time-varying interactions effectively. To address these challenges, we propose TimeFilter, a GNN-based framework for adaptive and fine-grained dependency modeling. After constructing the graph from the input sequence, TimeFilter refines the learned spatial-temporal dependencies by filtering out irrelevant correlations while preserving the most critical ones in a patch-specific manner. Extensive experiments on 13 real-world datasets from diverse application domains demonstrate the state-of-the-art performance of TimeFilter. The code is available athttps://github.com/TROUBADOUR000/TimeFilter.

Paperid:2494 poster

Authors:Yuqi Fan, Zhiyong Cui, Zhenning Li, Yilong Ren, Haiyang Yu

Title: SAH-Drive: A Scenario-Aware Hybrid Planner for Closed-Loop Vehicle Trajectory Generation

Abstract:
Reliable planning is crucial for achieving autonomous driving. Rule-based planners are efficient but lack generalization, while learning-based planners excel in generalization yet have limitations in real-time performance and interpretability. In long-tail scenarios, these challenges make planning particularly difficult. To leverage the strengths of both rule-based and learning-based planners, we proposed theScenario-Aware Hybrid Planner(SAH-Drive) for closed-loop vehicle trajectory planning. Inspired by human driving behavior, SAH-Drive combines a lightweight rule-based planner and a comprehensive learning-based planner, utilizing a dual-timescale decision neuron to determine the final trajectory. To enhance the computational efficiency and robustness of the hybrid planner, we also employed a diffusion proposal number regulator and a trajectory fusion module. The experimental results show that the proposed method significantly improves the generalization capability of the planning system, achieving state-of-the-art performance in interPlan, while maintaining computational efficiency without incurring substantial additional runtime.

Paperid:2495 poster

Authors:Md Yousuf Harun, Jhair Gallardo, Christopher Kanan

Title: Controlling Neural Collapse Enhances Out-of-Distribution Detection and Transfer Learning

Abstract:
Out-of-distribution (OOD) detection and OOD generalization are widely studied in Deep Neural Networks (DNNs), yet their relationship remains poorly understood. We empirically show that the degree of Neural Collapse (NC) in a network layer is inversely related with these objectives: stronger NC improves OOD detection but degrades generalization, while weaker NC enhances generalization at the cost of detection. This trade-off suggests that a single feature space cannot simultaneously achieve both tasks. To address this, we develop a theoretical framework linking NC to OOD detection and generalization. We show that entropy regularization mitigates NC to improve generalization, while a fixed Simplex ETF projector enforces NC for better detection. Based on these insights, we propose a method to control NC at different DNN layers. In experiments, our method excels at both tasks across OOD datasets and DNN architectures.

Paperid:2496 poster

Authors:Chong Zhang, Peng Zhang, Mengke Li

Title: Gamma Distribution PCA-Enhanced Feature Learning for Angle-Robust SAR Target Recognition

Abstract:
Scattering characteristics of synthetic aperture radar (SAR) targets are typically related to observed azimuth and depression angles. However, in practice, it is difficult to obtain adequate training samples at all observation angles, which probably leads to poor robustness of deep networks. In this paper, we first propose a Gamma-Distribution Principal Component Analysis ($\Gamma$PCA) model that fully accounts for the statistical characteristics of SAR data. The $\Gamma$PCA derives consistent convolution kernels to effectively capture the angle-invariant features of the same target at various attitude angles, thus alleviating deep models' sensitivity to angle changes in SAR target recognition task. We validate $\Gamma$PCA model based on two commonly used backbones, ResNet and ViT, and conduct multiple robustness experiments on the MSTAR benchmark dataset. The experimental results demonstrate that $\Gamma$PCA effectively enables the model to withstand substantial distributional discrepancy caused by angle changes. Additionally, $\Gamma$PCA convolution kernel is designed to require no parameter updates, introducing no extra computational burden to the network. The source code is available at \href{https://github.com/ChGrey/GammaPCA}{https://github.com/ChGrey/GammaPCA}.

Paperid:2497 poster

Authors:Wei Li, Lujun Li, Hao Gu, You-Liang Huang, Mark G. Lee, Shengjie Sun, Wei Xue, Yike Guo

Title: MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition

Abstract:
Mixture of Experts (MoE) architecture improves Large Language Models (LLMs) with better scaling, but its higher parameter counts and memory demands create challenges for deployment. In this paper, we present MoE-SVD, a new decomposition-based compression framework tailored for MoE LLMs without any extra training. By harnessing the power of Singular Value Decomposition (SVD), MoE-SVD addresses the critical issues of decomposition collapse and matrix redundancy in MoE architectures. Specifically, we first decompose experts into compact low-rank matrices, resulting in accelerated inference and memory optimization. In particular, we propose selective decomposition strategy by measuring sensitivity metrics based on weight singular values and activation statistics to automatically identify decomposable expert layers. Then, we share a single V-matrix across all experts and employ a top-k selection for U-matrices. This low-rank matrix sharing and trimming scheme allows for significant parameter reduction while preserving diversity among experts. Comprehensive experiments on Mixtral, Phi-3.5, DeepSeek, and Qwen2 MoE LLMs show MoE-SVD outperforms other compression methods, achieving a 60% compression ratio and 1.5× faster inference with minimal performance loss. Codes are available at:https://github.com/lliai/MoE-SVD.

Paperid:2498 poster

Authors:Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang

Title: Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport

Abstract:
Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning.

Paperid:2499 poster

Authors:Yuanchao Xu, Kaidi Shao, Nikos K. Logothetis, Zhongwei Shen

Title: ResKoopNet: Learning Koopman Representations for Complex Dynamics with Spectral Residuals

Abstract:
Analyzing the long-term behavior of high-dimensional nonlinear dynamical systems remains a significant challenge. While the Koopman operator framework provides a powerful global linearization tool, current methods for approximating its spectral components often face theoretical limitations and depend on predefined dictionaries. Residual Dynamic Mode Decomposition (ResDMD) advanced the field by introducing the \emph{spectral residual} to assess Koopman operator approximation accuracy; however, its approach of only filtering precomputed spectra prevents the discovery of the operator's complete spectral information, a limitation known as the `spectral inclusion' problem. We introduce ResKoopNet (Residual-based Koopman-learning Network), a novel method that directly addresses this by explicitly minimizing the \emph{spectral residual} to compute Koopman eigenpairs. This enables the identification of a more precise and complete Koopman operator spectrum. Using neural networks, our approach provides theoretical guarantees while maintaining computational adaptability. Experiments on a variety of physical and biological systems show that ResKoopNet achieves more accurate spectral approximations than existing methods, particularly for high-dimensional systems and those with continuous spectra, which demonstrates its effectiveness as a tool for analyzing complex dynamical systems.

Paperid:2500 poster

Authors:Zipeng Ji, Guanghui Zhu, Chunfeng Yuan, Yihua Huang

Title: RZ-NAS: Enhancing LLM-guided Neural Architecture Search via Reflective Zero-Cost Strategy

Abstract:
LLM-to-NAS is a promising field at the intersection of Large Language Models (LLMs) and Neural Architecture Search (NAS), as recent research has explored the potential of architecture generation leveraging LLMs on multiple search spaces. However, the existing LLM-to-NAS methods face the challenges of limited search spaces, time-cost search efficiency, and uncompetitive performance across standard NAS benchmarks and multiple downstream tasks. In this work, we propose the Reflective Zero-cost NAS (RZ-NAS) method that can search NAS architectures with humanoid reflections and training-free metrics to elicit the power of LLMs. We rethink LLMs�?roles in NAS in current work and design a structured, prompt-based to comprehensively understand the search tasks and architectures from both text and code levels. By integrating LLM reflection modules, we use LLM-generated feedback to provide linguistic guidance within architecture optimization. RZ-NAS enables effective search within both micro and macro search spaces without extensive time cost, achieving SOTA performance across multiple downstream tasks.

Paperid:2501 poster

Authors:Huang Liang, Benedict Lee, Daniel Hui Loong Ng, Kelin Xia

Title: Contrastive Learning with Simplicial Convolutional Networks for Short-Text Classification

Abstract:
Text classification is a fundamental task in Natural Language Processing (NLP). Short text classification has recently captured much attention due to its increased amount from various sources with limited labels and its inherent challenges for its sparsity in words and semantics. Recent studies have adopted self-supervised contrastive learning across different representations to improve performance. However, most of the current models face several challenges. Firstly, the augmentation step might not be able to generate positive and negative samples that are semantically similar and dissimilar to the anchor respectively. Secondly, the text data could be enhanced with external auxiliary information that might introduce noise to the sparse text data. In addition, they are limited in capturing higher-order information such as group-wise interactions. In this work, we propose a novel document simplicial complex construction based on text data for a higher-order message-passing mechanism. We enhance the short text classification performance by contrasting the structural representation with the sequential representation generated by the transformer mechanism for improved outcomes and mitigated issues. The proposed framework, Contrastive Learning with Simplicial Convolutional Networks (C-SCN), leverages the expressive power of graph neural networks, models higher-order information beyond pair-wise relations and enriches features through contrastive learning. Experimental results on four benchmark datasets demonstrate the capability of C-SCN to outperform existing models in analysing sequential and complex short-text data.

Paperid:2502 poster

Authors:Yucheng Yang, Tianyi Zhou, Mykola Pechenizkiy, Meng Fang

Title: Preference Controllable Reinforcement Learning with Advanced Multi-Objective Optimization

Abstract:
Practical reinforcement learning (RL) usually requires agents to be optimized for multiple potentially conflicting criteria, e.g. speed vs. safety. Although Multi-Objective RL (MORL) algorithms have been studied in previous works, their trained agents often cover limited Pareto optimal solutions and they lack precise controllability of the delicate trade-off among multiple objectives. Hence, the resulting agent is not versatile in aligning with customized requests from different users. To bridge the gap, we develop the ``Preference controllable (PC) RL'' framework, which trains a preference-conditioned meta-policy that takes user preference as input controlling the generated trajectories within the preference region on the Pareto frontier. The PCRL framework is compatible with advanced Multi-Objective Optimization~(MOO) algorithms that are rarely seen in previous MORL approaches.We also proposed a novel preference-regularized MOO algorithm specifically for PCRL. We provide a comprehensive theoretical analysis to justify its convergence and preference controllability. We evaluate PCRL with different MOO algorithms against state-of-the-art MORL baselines in various challenging environments with up to six objectives. In these experiments, our proposed method exhibits significantly better controllability than existing approaches and can generate Pareto solutions with better diversity and utilities.

Paperid:2503 poster

Authors:Mara Finkelstein, Daniel Deutsch, Parker Riley, Juraj Juraska, Geza Kovacs, Markus Freitag

Title: From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set

Abstract:
As LLMs continue to become more powerful and versatile, human evaluation has become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. TheseAutoratersare typically designed so that they generalize to new systemsandtest sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure the capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate ourSpecialistmethod on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.

Paperid:2504 poster

Authors:Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, Rohin Shah

Title: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Abstract:
Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Paperid:2505 poster

Authors:Wanyun Xie, Francesco Tonin, Volkan Cevher

Title: Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning

Abstract:
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available athttps://github.com/LIONS-EPFL/Chameleon.

Paperid:2506 poster

Authors:Muleilan Pei, Shaoshuai Shi, Lu Zhang, Peiliang Li, Shaojie Shen

Title: GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction

Abstract:
Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novelGraph-orientedInverseReinforcementLearning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.

Paperid:2507 poster

Authors:Cheng Tang, Zhishuai Liu, Pan Xu

Title: Robust Offline Reinforcement Learning with Linearly Structuredf-Divergence Regularization

Abstract:
The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.

Paperid:2508 poster

Authors:Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang

Title: Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Abstract:
Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available athttps://github.com/QinwenLuo/SSAR.

Paperid:2509 poster

Authors:Antonio Ocello, Daniil Tiapkin, Lorenzo Mancini, Mathieu Lauriere, Eric Moulines

Title: Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean Field Games

Abstract:
We introduce Mean Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean Field Games (MFGs) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.

Paperid:2510 poster

Authors:Jaehyun Kwak, Ramahdani Muhammad Izaaz Inhar, Se-Young Yun, Sung-Ju Lee

Title: QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval

Abstract:
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available athttps://github.com/jackwaky/QuRe.

Paperid:2511 poster

Authors:Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li

Title: Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Abstract:
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing “under�?or “behind�?relationships between only two objects, pose significant challenges for current VLMs. We believe it is crucial to use the lens of mechanism interpretability, opening up the model and diving into model’s internal states to examine the interactions between image and text tokens during spatial reasoning. Our analysis of attention behaviors reveals significant differences in how VLMs allocate attention to image versus text. By tracing the areas of images that receive the highest attention scores throughout intermediate layers, we observe a notable pattern: errors often coincide with attention being misdirected towards irrelevant objects within the image. Moreover, such attention patterns exhibit substantial differences between familiar (e.g., “on the left side of �? and unfamiliar (e.g.,“in front of �? spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when the model exhibits high confidence, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible additional cost.

Paperid:2512 poster

Authors:Yihao Yang, Wenke Huang, Guancheng Wan, Bin Yang, Mang Ye

Title: Federated Disentangled Tuning with Textual Prior Decoupling and Visual Dynamic Adaptation

Abstract:
Federated Parameter-Efficient Fine-Tuning aims to adapt Vision-Language Models for downstream tasks in distributed environments. However, data heterogeneity across participants hinders collaborative effectiveness, necessitating personalized adaptation to cover distinct data distributions. Current personalized methods suffer from two limitations. 1) Textual Property Loss: Existing methods facilitate the collaboration between decoupled prompts at the feature level, which potentially undermines the textual properties of the prompts. 2) Visual Feature Diversity: The diversity of visual features makes it challenging to leverage naive image features directly for image-text alignment in downstream tasks. In this work, we propose Federated Disentangled Tuning with Textual Prior Decoupling and Visual Dynamic Adaptation (FedDDA) to overcome the above limitations. Specifically, we encourage decoupling prompts in a way that maximizes the efficacy of prior knowledge, which is essential for maintaining a coherent linguistic context. Furthermore, we design a visual adaption model to reshape visual space to optimally align with the textual space. Extensive experiments on various image classification tasks show the effectiveness of our work in addressing data heterogeneity. The codes are released athttps://github.com/MoratalYang/FedDDA.

Paperid:2513 poster

Authors:Songlin Zhai, Yuan Meng, Yongrui Chen, Yiwei Wang, Guilin Qi

Title: Peripheral Memory for LLMs: Integration of Sequential Memory Banks with Adaptive Querying

Abstract:
Large Language Models (LLMs) have revolutionized various natural language processing tasks with their remarkable capabilities. However, challenges persist in effectively integrating new knowledge into LLMs without compromising their performance, particularly in the Large Language Models (LLMs) have revolutionized various natural language processing tasks with their remarkable capabilities. However, a challenge persists in effectively processing new information, particularly in the area of long-term knowledge updates without compromising model performance. To address this challenge, this paper introduces a novel memory augmentation framework that conceptualizes memory as a peripheral component (akin to physical RAM), with the LLM serving as the information processor (analogous to a CPU). Drawing inspiration from RAM architecture, we design memory as a sequence of memory banks, each modeled using Kolmogorov-Arnold Network (KAN) to ensure smooth state transitions. Memory read and write operations are dynamically controlled by query signals derived from the LLMs' internal states, closely mimicking the interaction between a CPU and RAM. Furthermore, a dedicated memory bank is used to generate a mask value that indicates the relevance of the retrieved data, inspired by the sign bit in binary coding schemes. The retrieved memory feature is then integrated as a prefix to enhance the model prediction. Extensive experiments on knowledge-based model editing validate the effectiveness and efficiency of our peripheral memory.

Paperid:2514 poster

Authors:Yang Jiao, Kai Yang, Chengtao Jian

Title: DTZO: Distributed Trilevel Zeroth Order Learning with Provable Non-Asymptotic Convergence

Abstract:
Trilevel learning (TLL) with zeroth order constraints is a fundamental problem in machine learning, arising in scenarios where gradient information is inaccessible due to data privacy or model opacity, such as in federated learning, healthcare, and financial systems. These problems are notoriously difficult to solve due to their inherent complexity and the lack of first order information. Moreover, in many practical scenarios, data may be distributed across various nodes, necessitating strategies to address trilevel learning problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the trilevel learning problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) trilevel learning problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the $\epsilon$-stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO.

Paperid:2515 poster

Authors:Erchi Wang, Yuqing Zhu, Yu-Xiang Wang

Title: Adapting to Linear Separable Subsets with Large-Margin in Differentially Private Learning

Abstract:
This paper studies the problem of differentially private empirical risk minimization (DP-ERM) for binary linear classification. We obtain an efficient $(\varepsilon,\delta)$-DP algorithm with an empirical zero-one risk bound of $\tilde{O}\left(\frac{1}{\gamma^2\varepsilon n} + \frac{|S_{\mathrm{out}}|}{\gamma n}\right)$ where $n$ is the number of data points, $S_{\mathrm{out}}$ is an arbitrary subset of data one can remove and $\gamma$ is the margin of linear separation of the remaining data points (after $S_{\mathrm{out}}$ is removed). Here, $\tilde{O}(\cdot)$ hides only logarithmic terms. In the agnostic case, we improve the existing results when the number of outliers is small. Our algorithm is highly adaptive because it does not require knowing the margin parameter $\gamma$ or outlier subset $S_{\mathrm{out}}$. We also derive a utility bound for the advanced private hyperparameter tuning algorithm.

Paperid:2516 poster

Authors:Magzhan Gabidolla, Miguel Á. Carreira-Perpiñán

Title: Generalized additive models via direct optimization of regularized decision stump forests

Abstract:
We explore ensembles of axis-aligned decision stumps, which can be viewed as a generalized additive model (GAM). In this model, stumps utilizing the same feature are grouped to form a shape function for that feature. Instead of relying on boosting or bagging, we employ alternating optimization to learn a fixed-size stump forest. We optimize the parameters of each stump exactly through enumeration, given the other stumps are fixed. For fixed stump splits, the leaf values are optimized jointly by solving a convex problem. To address the overfitting issue inherent in naive optimization of stump forests, we propose effective regularization techniques. Our regularized stump forests achieve accuracy comparable to state-of-the-art GAM methods while using fewer parameters. This work is the first to successfully learn stump forests without employing traditional ensembling techniques like bagging or boosting.

Paperid:2517 poster

Authors:Rohit Sonker, Alexandre Capone, Andrew Rothstein, Hiro Josep Farre Kaga, Egemen Kolemen, Jeff Schneider

Title: Multi-Timescale Dynamics Model Bayesian Optimization for Plasma Stabilization in Tokamaks

Abstract:
Machine learning algorithms often struggle to control complex real-world systems. In the case of nuclear fusion, these challenges are exacerbated, as the dynamics are notoriously complex, data is poor, hardware is subject to failures, and experiments often affect dynamics beyond the experiment's duration. Existing tools like reinforcement learning, supervised learning, and Bayesian optimization address some of these challenges but fail to provide a comprehensive solution. To overcome these limitations, we present a multi-scale Bayesian optimization approach that integrates a high-frequency data-driven dynamics model with a low-frequency Gaussian process. By updating the Gaussian process between experiments, the method rapidly adapts to new data, refining the predictions of the less reliable dynamical model. We validate our approach by controlling tearing instabilities in the DIII-D nuclear fusion plant. Offline testing on historical data shows that our method significantly outperforms several baselines. Results on live experiments on the DIII-D tokamak, conducted under high-performance plasma scenarios prone to instabilities, shows a 50% success rate �?marking a 117% improvement over historical outcomes.

Paperid:2518 poster

Authors:Hideaki Kim, Tomoharu Iwata, Akinori Fujino

Title: K2IE: Kernel Method-based Kernel Intensity Estimators for Inhomogeneous Poisson Processes

Abstract:
Kernel method-based intensity estimators, formulated within reproducing kernel Hilbert spaces (RKHSs), and classical kernel intensity estimators (KIEs) have been among the most easy-to-implement and feasible methods for estimating the intensity functions of inhomogeneous Poisson processes. While both approaches share the term "kernel", they are founded on distinct theoretical principles, each with its own strengths and limitations. In this paper, we propose a novel regularized kernel method for Poisson processes based on the least squares loss and show that the resulting intensity estimator involves a specialized variant of the representer theorem: it has the dual coefficient of unity and coincides with classical KIEs. This result provides new theoretical insights into the connection between classical KIEs and kernel method-based intensity estimators, while enabling us to develop an efficient KIE by leveraging advanced techniques from RKHS theory. We refer to the proposed model as thekernel method-based kernel intensity estimator(K$^2$IE). Through experiments on synthetic datasets, we show that K$^2$IE achieves comparable predictive performance while significantly surpassing the state-of-the-art kernel method-based estimator in computational efficiency.

Paperid:2519 poster

Authors:Yingzhao Jian, Yue Zhang, Ying Wei, Hehe Fan, Yi Yang

Title: Reaction Graph: Towards Reaction-Level Modeling for Chemical Reactions with 3D Structures

Abstract:
Accurately modeling chemical reactions using Artificial Intelligence (AI) can accelerate discovery and development, especially in fields like drug design and material science. Although AI has made remarkable advancements in single molecule recognition, such as predicting molecular properties, the study of interactions between molecules, particularly chemical reactions, has been relatively overlooked. In this paper, we introduce Reaction Graph (RG), a unified graph representation that encapsulates the 3D molecular structures within chemical reactions. RG integrates the molecular graphs of reactants and products into a cohesive framework, effectively capturing the interatomic relationships pertinent to the reaction process. Additionally, it incorporates the 3D structure information of molecules in a simple yet effective manner. We conduct experiments on a range of tasks, including chemical reaction classification, condition prediction, and yield prediction. RG achieves the highest accuracy across six datasets, demonstrating its effectiveness. The code is available athttps://github.com/Shadow-Dream/Reaction-Graph.

Paperid:2520 poster

Authors:Maria Despoina Siampou, Jialiang Li, John Krumm, Cyrus Shahabi, Hua Lu

Title: Poly2Vec: Polymorphic Fourier-Based Encoding of Geospatial Objects for GeoAI Applications

Abstract:
Encoding geospatial objects is fundamental for geospatial artificial intelligence (GeoAI) applications, which leverage machine learning (ML) models to analyze spatial information. Common approaches transform each object into known formats, like image and text, for compatibility with ML models. However, this process often discards crucial spatial information, such as the object's position relative to the entire space, reducing downstream task effectiveness. Alternative encoding methods that preserve some spatial properties are often devised for specific data objects (e.g., point encoders), making them unsuitable for tasks that involve different data types (i.e., points, polylines, and polygons). To this end, we propose Poly2Vec, a polymorphic Fourier-based encoding approach that unifies the representation of geospatial objects, while preserving the essential spatial properties. Poly2Vec incorporates a learned fusion module that adaptively integrates the magnitude and phase of the Fourier transform for different tasks and geometries. We evaluate Poly2Vec on five diverse tasks, organized into two categories. The first empirically demonstrates that Poly2Vec consistently outperforms object-specific baselines in preserving three key spatial relationships: topology, direction, and distance. The second shows that integrating Poly2Vec into a state-of-the-art GeoAI workflow improves the performance in two popular tasks: population prediction and land use inference.

Paperid:2521 poster

Authors:Yilong Chen, Junyuan Shang, Zhenyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

Title: Mixture of Hidden-Dimensions: Not All Hidden-States�?Dimensions are Needed in Transformer

Abstract:
Transformer models encounter inefficiency when scaling hidden dimensions due to the uniform expansion of parameters. When delving into the sparsity of hidden dimensions, we observe that only a small subset of dimensions are highly activated, where some dimensions are commonly activated across tokens, and some others uniquely activated for individual tokens. To leverage this, we propose MoHD (Mixture of Hidden Dimensions), a sparse architecture that combines shared sub-dimensions for common features and dynamically routes specialized sub-dimensions per token. To address the potential information loss from sparsity, we introduce activation scaling and group fusion mechanisms. MoHD efficiently expands hidden dimensions with minimal computational increases, outperforming vanilla Transformers in both parameter efficiency and task performance across 10 NLP tasks. MoHD achieves 1.7% higher performance with 50% fewer activatied parameters and 3.7% higher performance with 3× total parameters expansion at constant activated parameters cost. MoHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity.

Paperid:2522 poster

Authors:Anna Hedström, Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Manuela Veloso

Title: To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models

Abstract:
We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets and LM families demonstrate safe, effective, non-degrading error correction and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose and efficient approach to mechanistic activation steering.

Paperid:2523 poster

Authors:Zifan LIU, Xinran Li, Jun Zhang

Title: C2IQL: Constraint-Conditioned Implicit Q-learning for Safe Offline Reinforcement Learning

Abstract:
Safe offline reinforcement learning aims to develop policies that maximize cumulative rewards while satisfying safety constraints without the need for risky online interaction. However, existing methods often struggle with the out-of-distribution (OOD) problem, leading to potentially unsafe and suboptimal policies. To address this issue, we first propose Constrained Implicit Q-learning (CIQL), a novel algorithm designed to avoid the OOD problem. In particular, CIQL expands the implicit update of reward value functions to constrained settings and then estimates cost value functions under the same implicit policy. Despite its advantages, the further performance improvement of CIQL is still hindered by the inaccurate discounted approximations of constraints. Thus, we further propose Constraint-Conditioned Implicit Q-learning (C2IQL). Building upon CIQL, C2IQL employs a cost reconstruction model to derive non-discounted cumulative costs from discounted values and incorporates a flexible, constraint-conditioned mechanism to accommodate dynamic safety constraints. Experiment results on DSRL benchmarks demonstrate the superiority of C2IQL compared to baseline methods in achieving higher rewards while guaranteeing safety constraints under different threshold conditions.

Paperid:2524 poster

Authors:Marino Kühne, Panagiotis D. Grontas, Giulia De Pasquale, Giuseppe Belgioioso, Florian Dorfler, John Lygeros

Title: Optimizing Social Network Interventions via Hypergradient-Based Recommender System Design

Abstract:
Although social networks have expanded the range of ideas and information accessible to users, they are also criticized for amplifying the polarization of user opinions. Given the inherent complexity of these phenomena, existing approaches to counteract these effects typically rely on handcrafted algorithms and heuristics. We propose an elegant solution: we act on the network weights that model user interactions on social networks (e.g., ranking of users�?shared content in feeds), to optimize a performance metric (e.g., minimize polarization), while users�?opinions follow the classical Friedkin-Johnsen model. Our formulation gives rise to a challenging, large-scale optimization problem with non-convex constraints, for which we develop a gradient-based algorithm. Our scheme is simple, scalable, and versatile, as it can readily integrate different, potentially non-convex, objectives. We demonstrate its merit by: (i) rapidly solving complex social network intervention problems with 4.8 million variables based on the Reddit, LiveJournal, and DBLP datasets; (ii) outperforming competing approaches in terms of both computation time and disagreement reduction.

Paperid:2525 poster

Authors:Timo Kaiser, Thomas Norrenbrock, Bodo Rosenhahn

Title: UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model

Abstract:
The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the class-agnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ method. Our model traces the root of uncertainty back to under-parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ alternative that can support user-prompting, enhance semi-supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.

Paperid:2526 poster

Authors:Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong

Title: Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups

Abstract:
Self-attributing neural networks (SANNs) present a potential path towards interpretable models for high-dimensional problems, but often face significant trade-offs in performance. In this work, we formally prove a lower bound on errors of per-feature SANNs, whereas group-based SANNs can achieve zero error and thus high performance. Motivated by these insights, we propose Sum-of-Parts (SOP), a framework that transforms any differentiable model into a group-based SANN, where feature groups are learned end-to-end without group supervision. SOP achieves state-of-the-art performance for SANNs on vision and language tasks, and we validate that the groups are interpretable on a range of quantitative and semantic metrics. We further validate the utility of SOP explanations in model debugging and cosmological scientific discovery.

Paperid:2527 poster

Authors:Junwei Su, Chuan Wu

Title: On the Interplay between Graph Structure and Learning Algorithms in Graph Neural Networks

Abstract:
This paper studies the interplay between learning algorithms and graph structure for graph neural networks (GNNs). Existing theoretical studies on the learning dynamics of GNNs primarily focus on the convergence rates of learning algorithms under the interpolation regime (noise-free) and offer only a crude connection between these dynamics and the actual graph structure (e.g., maximum degree). This paper aims to bridge this gap by investigating the excessive risk (generalization performance) of learning algorithms in GNNs within the generalization regime (with noise). Specifically, we extend the conventional settings from the learning theory literature to the context of GNNs and examine how graph structure influences the performance of learning algorithms such as stochastic gradient descent (SGD) and Ridge regression. Our study makes several key contributions toward understanding the interplay between graph structure and learning in GNNs. First, we derive the excess risk profiles of SGD and Ridge regression in GNNs and connect these profiles to the graph structure through spectral graph theory. With this established framework, we further explore how different graph structures (regular vs. power-law) impact the performance of these algorithms through comparative analysis. Additionally, we extend our analysis to multi-layer linear GNNs, revealing an increasing non-isotropic effect on the excess risk profile, thereby offering new insights into the over-smoothing issue in GNNs from the perspective of learning algorithms. Our empirical results align with our theoretical predictions, \emph{collectively showcasing a coupling relation among graph structure, GNNs and learning algorithms, and providing insights on GNN algorithm design and selection in practice.}

Paperid:2528 poster

Authors:Ziyan Wang, Zhicheng Zhang, Fei Fang, Yali Du

Title: M³HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

Abstract:
Designing effective reward functions in multi-agent reinforcement learning (MARL) is a significant challenge, often leading to suboptimal or misaligned behaviors in complex, coordinated environments. We introduce Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality ($\text{M}^3\text{HF}$), a novel framework that integrates multi-phase human feedback of mixed quality into the MARL training process. By involving humans with diverse expertise levels to provide iterative guidance, $\text{M}^3\text{HF}$ leverages both expert and non-expert feedback to continuously refine agents' policies. During training, we strategically pause agent learning for human evaluation, parse feedback using large language models to assign it appropriately and update reward functions through predefined templates and adaptive weights by using weight decay and performance-based adjustments. Our approach enables the integration of nuanced human insights across various levels of quality, enhancing the interpretability and robustness of multi-agent cooperation. Empirical results in challenging environments demonstrate that $\text{M}^3\text{HF}$ significantly outperforms state-of-the-art methods, effectively addressing the complexities of reward design in MARL and enabling broader human participation in the training process.

Paperid:2529 poster

Authors:Keyue Qiu, Yuxuan Song, Jie Yu, Hongbo Ma, Ziyao Cao, Zhilong Zhang, Yushuai Wu, Mingyue Zheng, Hao Zhou, Wei-Ying Ma

Title: Empower Structure-Based Molecule Optimization with Gradient Guided Bayesian Flow Networks

Abstract:
Structure-based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets. A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities. To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance. We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization. MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3%, Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x ``Me-Better'' Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility. Code is available athttps://github.com/AlgoMole/MolCRAFT.

Paperid:2530 poster

Authors:Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, De Wen Soh, Jun Liu

Title: CPCF: A Cross-Prompt Contrastive Framework for Referring Multimodal Large Language Models

Abstract:
Referring MLLMs extend conventional multimodal large language models by allowing them to receive referring visual prompts and generate responses tailored to the indicated regions. However, these models often suffer from suboptimal performance due to incorrect responses tailored to misleading areas adjacent to or similar to the target region. This work introduces CPCF, a novel framework to address this issue and achieve superior results. CPCF contrasts outputs generated from the indicated visual prompt with those from contrastive prompts sampled from misleading regions, effectively suppressing the influence of erroneous information outside the target region on response generation. To further enhance the effectiveness and efficiency of our framework, several novel designs are proposed, including a prompt extraction network to automatically identify suitable contrastive prompts, a self-training method that leverages unlabeled data to improve training quality, and a distillation approach to reduce the additional computational overhead associated with contrastive decoding. Incorporating these novel designs, CPCF achieves state-of-the-art performance, as demonstrated by extensive experiments across multiple benchmarks. Project page:https://lanyunzhu.site/CPCF/

Paperid:2531 poster

Authors:Xuening Feng, Zhaohui JIANG, Timo Kaufmann, Eyke Hüllermeier, Paul Weng, Yifei Zhu

Title: Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries

Abstract:
Learning human objectives from preference feedback has significantly advanced reinforcement learning (RL) in domains where objectives are hard to formalize. However, traditional methods based on pairwise trajectory comparisons face notable challenges, including the difficulty in comparing trajectories with subtle differences and the limitation of conveying only ordinal information, limiting direct inference of preference strength. In this paper, we introduce a noveldistinguishability query, enabling humans to express preference strength by comparing two pairs of trajectories. Labelers first indicate which of two pairs is easier to distinguish, then provide preference feedback only on the easier pair. Our proposed query type directly captures preference strength and is expected to reduce the cognitive load on the labeler. We further connect this query to cardinal utility and difference relations and develop an efficient query selection scheme to achieve a better trade-off between query informativeness and easiness. Experimental results demonstrate the potential of our method for faster, data-efficient learning and improved user-friendliness in RLHF benchmarks, particularly in classical control settings where preference strength is critical for expected utility maximization.

Paperid:2532 poster

Authors:Rohan Deb, Kiran Koshy Thekumparampil, Kousha Kalantari, Gaurush Hiranandani, Shoham Sabach, Branislav Kveton

Title: FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain

Abstract:
Supervised fine-tuning (SFT) is the most common way of adapting large language models (LLMs) to a new domain. In this paper, we improve the efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we select those that maximize information gain, as measured by the Fisher information matrix of the SFT objective. We approximate it efficiently by linearization at the last layer of the LLM. Our approach is computationally efficient, analyzable, and performs well empirically. We demonstrate this on several problems, with both quantitative results and LLM-as-a-judge evaluations.

Paperid:2533 poster

Authors:Sayed Mohammad Hosseini, Maryam Shanechi

Title: Dynamical Modeling of Behaviorally Relevant Spatiotemporal Patterns in Neural Imaging Data

Abstract:
High-dimensional imaging of neural activity, such as widefield calcium and functional ultrasound imaging, provide a rich source of information for understanding the relationship between brain activity and behavior. Accurately modeling neural dynamics in these modalities is crucial for understanding this relationship but is hindered by the high-dimensionality, complex spatiotemporal dependencies, and prevalent behaviorally irrelevant dynamics in these modalities. Existing dynamical models often employ preprocessing steps to obtain low-dimensional representations from neural image modalities. However, this process can discard behaviorally relevant information and miss spatiotemporal structure. We propose SBIND, a novel data-driven deep learning framework to model spatiotemporal dependencies in neural images and disentangle their behaviorally relevant dynamics from other neural dynamics. We validate SBIND on widefield imaging datasets, and show its extension to functional ultrasound imaging, a recent modality whose dynamical modeling has largely remained unexplored. We find that our model effectively identifies both local and long-range spatial dependencies across the brain while also dissociating behaviorally relevant neural dynamics. Doing so, SBIND outperforms existing models in neural-behavioral prediction. Overall, SBIND provides a versatile tool for investigating the neural mechanisms underlying behavior using imaging modalities.

Paperid:2534 poster

Authors:Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia

Title: TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

Abstract:
Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available athttps://github.com/dvlab-research/TGDPO.

Paperid:2535 poster

Authors:Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

Title: SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Abstract:
The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficialactionsandeffectsthe AI behavior may lead to, along withlikelihood,severity, andimmediacylabels that describe potential impacts onstakeholders. SafetyAnalyst then aggregates all effects into a harmfulness score using 28 fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this framework to develop an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On comprehensive benchmarks, we show that SafetyAnalyst (average F1=0.81) outperforms existing moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.

Paperid:2536 poster

Authors:Krzysztof Kacprzyk, Julianna Piskorz, Mihaela van der Schaar

Title: Skip the Equations: Learning Behavior of Personalized Dynamical Systems Directly From Data

Abstract:
While black-box approaches are commonly used for data-driven modeling of dynamical systems, they often obscure a system's underlying behavior and properties, limiting adoption in areas such as medicine and pharmacology. A two-step process of discovering ordinary differential equations (ODEs) and their subsequent mathematical analysis can yield insights into the system's dynamics. However, this analysis may be infeasible for complex equations, and refining the ODE to meet certain behavioral requirements can be challenging. Direct semantic modeling has recently been proposed to address these issues by predicting the system's behavior, such as the trajectory's shape, directly from data, bypassing post-hoc mathematical analysis. In this work, we extend the original instantiation, limited to one-dimensional trajectories and inputs, to accommodate multi-dimensional trajectories with additional personalization, allowing evolution to depend on auxiliary static features (e.g., patient covariates). In a series of experiments, we show how our approach enables practitioners to integrate prior knowledge, understand the dynamics, ensure desired behaviors, and revise the model when necessary.

Paperid:2537 poster

Authors:Flavio Petruzzellis, Cristina Cornelio, Pietro Lio

Title: Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification

Abstract:
Large Language Models (LLMs) have shown promise as robotic planners but often struggle with long-horizon and complex tasks, especially in specialized environments requiring external knowledge. While hierarchical planning and Retrieval-Augmented Generation (RAG) address some of these challenges, they remain insufficient on their own and a deeper integration is required for achieving more reliable systems. To this end, we propose a neuro-symbolic approach that enhances LLMs-based planners with Knowledge Graph-based RAG for hierarchical plan generation. This method decomposes complex tasks into manageable subtasks, further expanded into executable atomic action sequences. To ensure formal correctness and proper decomposition, we integrate a Symbolic Validator, which also functions as a failure detector by aligning expected and observed world states. Our evaluation against baseline methods demonstrates the consistent significant advantages of integrating hierarchical planning, symbolic verification, and RAG across tasks of varying complexity and different LLMs. Additionally, our experimental setup and novel metrics not only validate our approach for complex planning but also serve as a tool for assessing LLMs' reasoning and compositional capabilities. Code available athttps://github.com/corneliocristina/HVR.

Paperid:2538 poster

Authors:Srijith Nair, Michael Lin, Peizhong Ju, Amirreza Talebi, Elizabeth Serena Bentley, Jia Liu

Title: FSL-SAGE: Accelerating Federated Split Learning via Smashed Activation Gradient Estimation

Abstract:
Collaborative training methods like Federated Learning (FL) and Split Learning (SL) enable distributed machine learning without sharing raw data. However, FL assumes clients can train entire models, which is infeasible for large-scale models. In contrast, while SL alleviates the client memory constraint in FL by offloading most training to the server, it increases network latency due to its sequential nature. Other methods address the conundrum by using local loss functions for parallel client-side training to improve efficiency, but they lack server feedback and potentially suffer poor accuracy. We propose FSL-SAGE (Federated Split Learning via Smashed Activation Gradient Estimation), a new federated split learning algorithm that estimates server-side gradient feedback via auxiliary models. These auxiliary models periodically adapt to emulate server behavior on local datasets. We show that FSL-SAGE achieves a convergence rate of $\mathcal{O}(1/\sqrt{T})$, where $T$ is the number of communication rounds. This result matches FedAvg, while significantly reducing communication costs and client memory requirements. Our empirical results also verify that it outperforms existing state-of-the-art FSL methods, offering both communication efficiency and accuracy.

Paperid:2539 poster

Authors:Luca M. Schulze Buschoff, Konstantinos Voudouris, Elif Akata, Matthias Bethge, Joshua B. Tenenbaum, Eric Schulz

Title: Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Abstract:
Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Paperid:2540 poster

Authors:Walid Bendada, Guillaume Salha-Galvan, Romain Hennequin, Théo Bontempelli, Thomas Bouabça, Tristan Cazenave

Title: Exploring Large Action Sets with Hyperspherical Embeddings using von Mises-Fisher Sampling

Abstract:
This paper introduces von Mises-Fisher exploration (vMF-exp), a scalable method for exploring large action sets in reinforcement learning problems where hyperspherical embedding vectors represent these actions. vMF-exp involves initially sampling a state embedding representation using a von Mises-Fisher distribution, then exploring this representation's nearest neighbors, which scales to virtually unlimited numbers of candidate actions. We show that, under theoretical assumptions, vMF-exp asymptotically maintains the same probability of exploring each action as Boltzmann Exploration (B-exp), a popular alternative that, nonetheless, suffers from scalability issues as it requires computing softmax values for each action. Consequently, vMF-exp serves as a scalable alternative to B-exp for exploring large action sets with hyperspherical embeddings. Experiments on simulated data, real-world public data, and the successful large-scale deployment of vMF-exp on the recommender system of a global music streaming service empirically validate the key properties of the proposed method.

Paperid:2541 poster

Authors:J Rishi, GVS Mothish, Deepak Subramani

Title: Conditional Diffusion Model with Nonlinear Data Transformation for Time Series Forecasting

Abstract:
Time-series forecasting finds application across domains such as finance, climate science, and energy systems. We introduce the Conditional Diffusion with Nonlinear Data Transformation Model (CN-Diff), a generative framework that employs novel nonlinear transformations and learnable conditions in the forward process for time series forecasting. A new loss formulation for training is proposed, along with a detailed derivation of both forward and reverse process. The new additions improve the diffusion model's capacity to capture complex time series patterns, thus simplifying the reverse process. Our novel condition facilitates learning an efficient prior distribution. This also reduces the gap between the true negative log-likelihood and its variational approximation. CN-Diff is shown to perform better than other leading time series models on nine real-world datasets. Ablation studies are conducted to elucidate the role of each component of CN-Diff.

Paperid:2542 poster

Authors:Julien Pourcel, Cédric Colas, Pierre-Yves Oudeyer

Title: Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI

Abstract:
Many program synthesis tasks prove too challenging for even state-of-the-art language models to solve in single attempts. Search-based evolutionary methods offer a promising alternative by exploring solution spaces iteratively, but their effectiveness remain limited by the fixed capabilities of the underlying generative model. We propose SOAR, a method that learns program synthesis by integrating language models into a self-improving evolutionary loop. SOAR alternates between (1) an evolutionary search that uses an LLM to sample and refine candidate solutions, and (2) a hindsight learning phase that converts search attempts into valid problem-solution pairs used to fine-tune the LLM's sampling and refinement capabilities—enabling increasingly effective search in subsequent iterations. On the challenging ARC-AGI benchmark, SOAR achieves significant performance gains across model scales and iterations, leveraging positive transfer between the sampling and refinement finetuning tasks. These improvements carry over to test-time adaptation, enabling SOAR to solve 52% of the public test set.

Paperid:2543 poster

Authors:Michael S Yao, James Gee, Osbert Bastani

Title: Diversity By Design: Leveraging Distribution Matching for Offline Model-Based Optimization

Abstract:
The goal of offline model-based optimization (MBO) is to propose new designs that maximize a reward function given only an offline dataset. However, an important desiderata is to also propose adiverseset of final candidates that capture many optimal and near-optimal design configurations. We proposeDiversityInAdversarialModel-basedOptimization (DynAMO) as a novel method to introduce design diversity as an explicit objective into any MBO problem. Our key insight is to formulate diversity as adistribution matching problemwhere the distribution of generated designs captures the inherent diversity contained within the offline dataset. Extensive experiments spanning multiple scientific domains show that DynAMO can be used with common optimization methods to significantly improve the diversity of proposed designs while still discovering high-quality candidates.

Paperid:2544 poster

Authors:Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen

Title: CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

Abstract:
Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assesses LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%-40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.

Paperid:2545 poster

Authors:Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han

Title: Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity

Abstract:
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D full attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D full attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28$\times$ and 2.33$\times$ end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code will be open-sourced upon publication.

Paperid:2546 poster

Authors:Yilin Ye, Junchao Huang, Xingchen ZENG, Jiazhi Xia, Wei Zeng

Title: AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings

Abstract:
Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available athttps://github.com/yilinye/AKRMap.

Paperid:2547 poster

Authors:Shengbin Ye, Meng Li

Title: Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large $p$

Abstract:
Symbolic regression (SR) is a powerful technique for discovering symbolic expressions that characterize nonlinear relationships in data, gaining increasing attention for its interpretability, compactness, and robustness. However, existing SR methods do not scale to datasets with a large number of input variables (referred to as extreme-scale SR), which is common in modern scientific applications. This "large $p$'' setting, often accompanied by measurement error, leads to slow performance of SR methods and overly complex expressions that are difficult to interpret. To address this scalability challenge, we propose a method called PAN+SR, which combines a key idea of ab initio nonparametric variable selection with SR to efficiently pre-screen large input spaces and reduce search complexity while maintaining accuracy. The use of nonparametric methods eliminates model misspecification, supporting a strategy called parametric-assisted nonparametric (PAN). We also extend SRBench, an open-source benchmarking platform, by incorporating high-dimensional regression problems with various signal-to-noise ratios. Our results demonstrate that PAN+SR consistently enhances the performance of 19 contemporary SR methods, enabling several to achieve state-of-the-art performance on these challenging datasets.

Paperid:2548 poster

Authors:Adel Javanmard, Vahab Mirrokni, Jean Pouget-Abadie

Title: Improving the Variance of Differentially Private Randomized Experiments through Clustering

Abstract:
Estimating causal effects from randomized experiments is only possible if participants are willing to disclose their potentially sensitive responses. Differential privacy, a widely used framework for ensuring an algorithm’s privacy guarantees, can encourage participants to share their responses without the risk of de-anonymization. However, many mechanisms achieve differential privacy by adding noise to the original dataset, which reduces the precision of causal effect estimation. This introduces a fundamental trade-off between privacy and variance when performing causal analyses on differentially private data. In this work, we propose a new differentially private mechanism, \textsc{Cluster-DP}, which leverages a given cluster structure in the data to improve the privacy-variance trade-off. While our results apply to any clustering, we demonstrate that selecting higher-quality clusters—according to a quality metric we introduce—can decrease the variance penalty without compromising privacy guarantees. Finally, we evaluate the theoretical and empirical performance of our \textsc{Cluster-DP} algorithm on both real and simulated data, comparing it to common baselines, including two special cases of our algorithm: its unclustered version and a uniform-prior version.

Paperid:2549 poster

Authors:Shuangpeng Han, Mengmi Zhang

Title: Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

Abstract:
AI models make mistakes when recognizing images—whether in-domain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a "mentor" model—a deep neural network designed to predict another "mentee" model’s errors. Our findings show that the mentor excels at learning from a mentee's mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an "oracle" mentor model, dubbed SuperMentor, that can outperform baseline mentors in predicting errors across different error types from the ImageNet-1K dataset. Our framework paves the way for future research on anticipating and correcting AI model behaviors, ultimately increasing trust in AI systems. Our data and code are available athere.

Paperid:2550 poster

Authors:HyunGi Kim, Jisoo Mok, Dongjun Lee, Jaihyun Lew, Sungjae Kim, Sungroh Yoon

Title: Causality-Aware Contrastive Learning for Robust Multivariate Time-Series Anomaly Detection

Abstract:
Utilizing the complex inter-variable causal relationships within multivariate time-series provides a promising avenue toward more robust and reliable multivariate time-series anomaly detection (MTSAD) but remains an underexplored area of research. This paper proposes Causality-Aware contrastive learning for RObust multivariate Time-Series (CAROTS), a novel MTSAD pipeline that incorporates the notion of causality into contrastive learning. CAROTS employs two data augmentors to obtain causality-preserving and -disturbing samples that serve as a wide range of normal variations and synthetic anomalies, respectively. With causality-preserving and -disturbing samples as positives and negatives, CAROTS performs contrastive learning to train an encoder whose latent space separates normal and abnormal samples based on causality. Moreover, CAROTS introduces a similarity-filtered one-class contrastive loss that encourages the contrastive learning process to gradually incorporate more semantically diverse samples with common causal relationships. Extensive experiments on five real-world and two synthetic datasets validate that the integration of causal relationships endows CAROTS with improved MTSAD capabilities. The code is available athttps://github.com/kimanki/CAROTS.

Paperid:2551 poster

Authors:Haotian Zhai, Connor Lawless, Ellen Vitercik, Liu Leqi

Title: EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations

Abstract:
A fundamental problem in combinatorial optimization is identifying equivalent formulations. Despite the growing need for automated equivalence checks---driven, for example, byoptimization copilots, which generate problem formulations from natural language descriptions---current approaches rely on simple heuristics that fail to reliably check formulation equivalence. Inspired by Karp reductions, in this work we introduceQuasi-Karp equivalence, a formal criterion for determining when two optimization formulations are equivalent based on the existence of a mapping between their decision variables. We proposeEquivaMap, a framework that leverages large language models to automatically discover such mappings for scalable, reliable equivalence checking, with a verification stage that ensures mapped solutions preserve feasibility and optimality without additional solver calls. To evaluate our approach, we constructEquivaFormulation, the first open-source dataset of equivalent optimization formulations, generated by applying transformations such as adding slack variables or valid inequalities to existing formulations. Empirically,EquivaMapsignificantly outperforms existing methods, achieving substantial improvements in correctly identifying formulation equivalence.

Paperid:2552 poster

Authors:Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu

Title: I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Abstract:
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that theLLM decodershares the same input feature space withdiffusion decodersthat use the correspondingLLM encoderfor prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page:https://mizhenxing.github.io/ThinkDiff.

Paperid:2553 poster

Authors:Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen

Title: PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

Abstract:
Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need formultipleARMs increases the inference cost, and theseparatetraining of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), asingleunified ARM trained acrossallpreference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available athttps://github.com/Baijiong-Lin/PARM.

Paperid:2554 poster

Authors:Bohan Lyu, Yadi Cao, Duncan Watson-Parris, Leon Bergen, Taylor Berg-Kirkpatrick, Rose Yu

Title: Adapting While Learning: Grounding LLMs for Scientific Problems with Tool Usage Adaptation

Abstract:
Large Language Models (LLMs) demonstrate promising capabilities in solving scientific problems but often suffer from the issue of hallucination. While integrating LLMs with tools can mitigate this issue, models fine-tuned on tool usage become overreliant on them and incur unnecessary costs. Inspired by how human experts assess problem complexity before selecting solutions, we propose a novel two-component fine-tuning method,Adapting while Learning(AWL). In the first componentWorld Knowledge Learning(WKL), LLMs internalize scientific knowledge by learning from tool-generated solutions. In the second componentTool Usage Adaptation(TUA), we categorize problems as easy or hard based on the model's accuracy, and train it to maintain direct reasoning for easy problems while switching to tools for hard ones. We validate our method on 6 scientific benchmark datasets across climate science, epidemiology, physics, and other domains. Compared to the original instruct model (8B), models post-trained with AWL achieve 29.11% higher answer accuracy and 12.72% better tool usage accuracy, even surpassing state-of-the-art models including GPT-4o and Claude-3.5 on 4 custom-created datasets. Our code is open-source at \url{https://github.com/Rose-STL-Lab/Adapting-While-Learning}.

Paperid:2555 poster

Authors:Renzhe Xu, Kang Wang, Bo Li

Title: Heterogeneous Data Game: Characterizing the Model Competition Across Multiple Data Sources

Abstract:
Data heterogeneity across multiple sources is common in real-world machine learning (ML) settings. Although many methods focus on enabling a single model to handle diverse data, real-world markets often comprise multiple competing ML providers. In this paper, we propose a game-theoretic framework—the Heterogeneous Data Game—to analyze how such providers compete across heterogeneous data sources. We investigate the resulting pure Nash equilibria (PNE), showing that they can be non-existent, homogeneous (all providers converge on the same model), or heterogeneous (providers specialize in distinct data sources). Our analysis spans monopolistic, duopolistic, and more general markets, illustrating how factors such as the ``temperature'' of data-source choice models and the dominance of certain data sources shape equilibrium outcomes. We offer theoretical insights into both homogeneous and heterogeneous PNEs, guiding regulatory policies and practical strategies for competitive ML marketplaces.

Paperid:2556 poster

Authors:Amir Najafi, Samin Mahdizadeh Sani, Farzan Farnia

Title: Certifiably Robust Model Evaluation in Federated Learning under Meta-Distributional Shifts

Abstract:
We address the challenge of certifying the performance of a federated learning model on an unseen target network using only measurements from the source network that trained the model. Specifically, consider a source network "A" withKclients, each holding private, non-IID datasets drawn from heterogeneous distributions, modeled as samples from a broader meta-distributionμ. Our goal is to provide certified guarantees for the model’s performance on a different, unseen network "B", governed by an unknown meta-distributionμ�? assuming the deviation betweenμandμ′is bounded—either in Wasserstein distance or anf-divergence. We derive worst-case uniform guarantees for both the model’s average loss and its risk CDF, the latter corresponding to a novel, adversarially robust version of the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality. In addition, we show how the vanilla DKW bound enables principled certification of the model's true performance on unseen clients within the same (source) network. Our bounds are efficiently computable, asymptotically minimax optimal, and preserve clients' privacy. We also establish non-asymptotic generalization bounds that converge to zero asKgrows and the minimum per-client sample size exceedsO(log⁡K). Empirical evaluations confirm the practical utility of our bounds across real-world tasks. The project code is available at: github.com/samin-mehdizadeh/Robust-Evaluation-DKW

Paperid:2557 poster

Authors:Zhaohe Liao, Jiangtong Li, Siyu Sun, Qingyang Liu, Fengshun Xiao, Tianjiao Li, Qiang Zhang, Guang Chen, Li Niu, Changjun Jiang, Liqing Zhang

Title: Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering

Abstract:
Video Question-Answering (VideoQA) remains challenging in achieving advanced cognitive reasoning due to the uncontrollable and opaque reasoning processes in existing Multimodal Large Language Models (MLLMs). To address this issue, we propose a novel Language-centric Tree Reasoning (LTR) framework that targets on enhancing the reasoning ability of models. In detail, it recursively divides the original question into logically manageable parts and conquers them piece by piece, enhancing the reasoning capabilities and interpretability of existing MLLMs. Specifically, in the first stage, the LTR focuses on language to recursively generate a language-centric logical tree, which gradually breaks down the complex cognitive question into simple perceptual ones and plans the reasoning path through a RAG-based few-shot approach. In the second stage, with the aid of video content, the LTR performs bottom-up logical reasoning within the tree to derive the final answer along with the traceable reasoning path. Experiments across 11 VideoQA benchmarks demonstrate that our LTR framework significantly improves both accuracy and interpretability compared to state-of-the-art MLLMs. To our knowledge, this is the first work to implement a language-centric logical tree to guide MLLM reasoning in VideoQA, paving the way for language-centric video understanding from perception to cognition.

Paperid:2558 poster

Authors:Changdae Oh, Zhen Fang, Shawn Im, Xuefeng Du, Yixuan Li

Title: Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach

Abstract:
Multimodal large language models (MLLMs) have shown promising capabilities but struggle under distribution shifts, where evaluation data differ from instruction tuning distributions. Although previous works have provided empirical evaluations, we argue that establishing a formal framework that can characterize and quantify the risk of MLLMs is necessary to ensure the safe and reliable application of MLLMs in the real world. By taking an information-theoretic perspective, we propose the first theoretical framework that enables the quantification of the maximum risk of MLLMs under distribution shifts. Central to our framework is the introduction of Effective Mutual Information (EMI), a principled metric that quantifies the relevance between input queries and model responses. We derive an upper bound for the EMI difference between in-distribution (ID) and out-of-distribution (OOD) data, connecting it to visual and textual distributional discrepancies. Extensive experiments on real benchmark datasets, spanning 61 shift scenarios, empirically validate our theoretical insights.

Paperid:2559 poster

Authors:Fanmeng Wang, Wentao Guo, Qi Ou, Hongshuai Wang, Haitao Lin, Hongteng Xu, Zhifeng Gao

Title: PolyConf: Unlocking Polymer Conformation Generation through Hierarchical Generative Models

Abstract:
Polymer conformation generation is a critical task that enables atomic-level studies of diverse polymer materials. While significant advances have been made in designing conformation generation methods for small molecules and proteins, these methods struggle to generate polymer conformations due to their unique structural characteristics. Meanwhile, the scarcity of polymer conformation datasets further limits the progress, making this important area largely unexplored. In this work, we propose PolyConf, a pioneering tailored polymer conformation generation method that leverages hierarchical generative models to unlock new possibilities. Specifically, we decompose the polymer conformation into a series of local conformations (i.e., the conformations of its repeating units), generating these local conformations through an autoregressive model, and then generating their orientation transformations via a diffusion model to assemble them into the complete polymer conformation. Moreover, we develop the first benchmark with a high-quality polymer conformation dataset derived from molecular dynamics simulations to boost related research in this area. The comprehensive evaluation demonstrates that PolyConf consistently outperforms existing conformation generation methods, thus facilitating advancements in polymer modeling and simulation. The whole work is available athttps://polyconf-icml25.github.io.

Paperid:2560 poster

Authors:Zhengzheng Lou, Hang Xue, Chaoyang Zhang, Shizhe Hu

Title: A Peer-review Look on Multi-modal Clustering: An Information Bottleneck Realization Method

Abstract:
Despite the superior capability in complementary information exploration and consistent clustering structure learning, most current weight-based multi-modal clustering methods still contain three limitations: 1) lack of trustworthiness in learned weights; 2) isolated view weight learning; 3) extra weight parameters. Motivated by the peer-review mechanism in the academia, we in this paper give a new peer-review look on the multi-modal clustering problem and propose to iteratively treat one modality as "author" and the remaining modalities as "reviewers" so as to reach a peer-review score for each modality. It essentially explores the underlying relationships among modalities. To improve the trustworthiness, we further design a new trustworthy score with a self-supervision working mechanism. Following that, we propose a novel Peer-review Trustworthy Information Bottleneck (PTIB) method for weighted multi-modal clustering, where both the above scores are simultaneously taken into account for accurate and parameter-free modality weight learning. Extensive experiments on eight multi-modal datasets suggest that PTIB can outperform the state-of-the-art multi-modal clustering methods.

Paperid:2561 poster

Authors:Ruinan Jin, Xiao Li, Yaoliang Yu, Baoxiang Wang

Title: A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD

Abstract:
Adaptive moment estimation (Adam) is a cornerstone optimization algorithm in deep learning, widely recognized for its flexibility with adaptive learning rates and efficiency in handling large-scale data. However, despite its practical success, the theoretical understanding of Adam's convergence has been constrained by stringent assumptions, such as almost surely bounded stochastic gradients or uniformly bounded gradients, which are more restrictive than those typically required for analyzing stochastic gradient descent (SGD).In this paper, we introduce a novel and comprehensive framework for analyzing the convergence properties of Adam. This framework offers a versatile approach to establishing Adam's convergence. Specifically, we prove that Adam achieves asymptotic (last iterate sense) convergence in both the almost sure sense and the (L_1) sense under the relaxed assumptions typically used for SGD, namely (L)-smoothness and the ABC inequality. Meanwhile, under the same assumptions, we show that Adam attains non-asymptotic sample complexity bounds similar to those of SGD.

Paperid:2562 poster

Authors:Ben Dai

Title: EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Abstract:
Empirical risk minimization (ERM) with a computationally feasible surrogate loss is a widely accepted approach for classification. Notably, the convexity and calibration (CC) properties of a loss function ensure consistency of ERM in maximizing accuracy, thereby offering a wide range of options for surrogate losses. In this article, we propose a novel ensemble method, namelyEnsLoss, which extends the ensemble learning concept to combine loss functions within the ERM framework. A key feature of our method is the consideration on preserving the "legitimacy" of the combined losses, i.e., ensuring the CC properties. Specifically, we first transform the CC conditions of losses into loss-derivatives, thereby bypassing the need for explicit loss functions and directly generating calibrated loss-derivatives. Therefore, inspired by Dropout,EnsLossenables loss ensembles through one training process with doubly stochastic gradient descent (i.e., random batch samples and random calibrated loss-derivatives). We theoretically establish the statistical consistency of our approach and provide insights into its benefits. The numerical effectiveness ofEnsLosscompared to fixed loss methods is demonstrated through experiments on a broad range of 45 pairs of CIFAR10 datasets, the PCam image dataset, and 14 OpenML tabular datasets and with various deep learning architectures. Python repository and source code are available on our Github (https://github.com/statmlben/ensLoss).

Paperid:2563 poster

Authors:Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang

Title: Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Abstract:
Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show for the first time that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs. Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (i.e., knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model’s parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE) and it appears since very first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The code is available athttps://github.com/MingyuJ666/Rope_with_LLM.

Paperid:2564 poster

Authors:Lucy Xiaoyang Shi, brian ichter, Michael Robert Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn

Title: Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Abstract:
Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available athttps://www.pi.website/research/hirobot

Paperid:2565 poster

Authors:Ibrahim Fayad, Max Zimmer, Martin Schwartz, Fabian Gieseke, Philippe CIAIS, Gabriel Belouze, Sarah Brood, Aurélien de Truchis, Alexandre d'Aspremont

Title: DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications

Abstract:
Significant efforts have been directed towards adapting self-supervised multimodal learning for Earth observation applications. However, most current methods produce coarse patch-sized embeddings, limiting their effectiveness and integration with other modalities like LiDAR. To close this gap, we present DUNIA, an approach to learn pixel-sized embeddings through cross-modal alignment between images and full-waveform LiDAR data. As the model is trained in a contrastive manner, the embeddings can be directly leveraged in the context of a variety of environmental monitoring tasks in a zero-shot setting. In our experiments, we demonstrate the effectiveness of the embeddings for seven such tasks: canopy height mapping, fractional canopy cover, land cover mapping, tree species identification, plant area index, crop type classification, and per-pixel waveform-based vertical structure mapping. The results show that the embeddings, along with zero-shot classifiers, often outperform specialized supervised models, even in low-data regimes. In the fine-tuning setting, we show strong performances near or better than the state-of-the-art on five out of six tasks.

Paperid:2566 poster

Authors:Peer Nagy, Sascha Yves Frey, Kang Li, Bidipta Sarkar, Svitlana Vyetrenko, Stefan Zohren, Ani Calinescu, Jakob Nicolaus Foerster

Title: LOB-Bench: Benchmarking Generative AI for Finance - an Application to Limit Order Book Data

Abstract:
While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we presentLOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains "market impact metrics", i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.

Paperid:2567 poster

Authors:Yuguang Yan, Zongyu Li, Haolin Yang, Zeqin Yang, Hao Zhou, Ruichu Cai, Zhifeng Hao

Title: Reducing Confounding Bias without Data Splitting for Causal Inference via Optimal Transport

Abstract:
Causal inference seeks to estimate the effect given a treatment such as a medicine or the dosage of a medication. To reduce the confounding bias caused by the non-randomized treatment assignment, most existing methods reduce the shift between subpopulations receiving different treatments. However, these methods split limited training samples into smaller groups, which cuts down the number of samples in each group, while precise distribution estimation and alignment highly rely on a sufficient number of training samples. In this paper, we propose a distribution alignment paradigm without data splitting, which can be naturally applied in the settings of binary and continuous treatments. To this end, we characterize the confounding bias by considering different probability measures of the same set including all the training samples, and exploit the optimal transport theory to analyze the confounding bias and outcome estimation error. Based on this, we propose to learn balanced representations by reducing the bias between the marginal distribution and the conditional distribution of a treatment. As a result, data reduction caused by splitting is avoided, and the outcome prediction model trained on one treatment group can be generalized to the entire population. The experiments on both binary and continuous treatment settings demonstrate the effectiveness of our method.

Paperid:2568 poster

Authors:Zhiqiang Wang, Jianghao Wen, Jianqing Liang

Title: Delay-DSGN: A Dynamic Spiking Graph Neural Network with Delay Mechanisms for Evolving Graph

Abstract:
Dynamic graph representation learning using Spiking Neural Networks (SNNs) exploits the temporal spiking behavior of neurons, offering advantages in capturing the temporal evolution and sparsity of dynamic graphs. However, existing SNN-based methods often fail to effectively capture the impact of latency in information propagation on node representations. To address this, we propose Delay-DSGN, a dynamic spiking graph neural network incorporating a learnable delay mechanism. By leveraging synaptic plasticity, the model dynamically adjusts connection weights and propagation speeds, enhancing temporal correlations and enabling historical data to influence future representations. Specifically, we introduce a Gaussian delay kernel into the neighborhood aggregation process at each time step, adaptively delaying historical information to future time steps and mitigating information forgetting. Experiments on three large-scale dynamic graph datasets demonstrate that Delay-DSGN outperforms eight state-of-the-art methods, achieving the best results in node classification tasks. We also theoretically derive the constraint conditions between the Gaussian kernel's standard deviation and size, ensuring stable training and preventing gradient explosion and vanishing issues.

Paperid:2569 poster

Authors:Emiliano Penaloza, Tianyue H. Zhang, Laurent Charlin, Mateo Espinosa Zarlenga

Title: Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Abstract:
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human understandable concepts. However, CBMs typically rely on datasets with assumedly accurate concept labels—an assumption often violated in practice which we show can significantly degrade performance. To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis on some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise

Paperid:2570 poster

Authors:Ju-Seung Byun, Andrew Perrault

Title: Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

Abstract:
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we focus on RL algorithms that share learning difficulties with cross-entropy loss, especially for low-probability predictions. To enhance stability, we adapt reverse cross-entropy (RCE) from supervised learning for noisy data, defining a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO). Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks such as IMDB positive sentiment and TL;DR summarization.

Paperid:2571 poster

Authors:Prashanth Vijayaraghavan, Luyao Shi, Ehsan Degan, Vandana Mukherjee, Xin Zhang

Title: AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation

Abstract:
Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL, a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, which further improves the instruction-tuned model using reward models that evaluate validity, efficiency, and output voltage. The refined model is then used directly to generate topologies that satisfy the design constraints. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework's effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.

Paperid:2572 poster

Authors:Sidhanth Holalkere, David Bindel, Silvia Sellán, Alexander Terenin

Title: Stochastic Poisson Surface Reconstruction with One Solve using Geometric Gaussian Processes

Abstract:
Poisson Surface Reconstruction is a widely-used algorithm for reconstructing a surface from an oriented point cloud. To facilitate applications where only partial surface information is available, or scanning is performed sequentially, a recent line of work proposes to incorporate uncertainty into the reconstructed surface via Gaussian process models. The resulting algorithms first perform Gaussian process interpolation, then solve a set of volumetric partial differential equations globally in space, resulting in a computationally expensive two-stage procedure. In this work, we apply recently-developed techniques from geometric Gaussian processes to combine interpolation and surface reconstruction into a single stage, requiring only one linear solve per sample. The resulting reconstructed surface samples can be queried locally in space, without the use of problem-dependent volumetric meshes or grids. These capabilities enable one to (a) perform probabilistic collision detection locally around the region of interest, (b) perform ray casting without evaluating points not on the ray's trajectory, and (c) perform next-view planning on a per-ray basis. They also do not requiring one to approximate kernel matrix inverses with diagonal matrices as part of intermediate computations, unlike prior methods. Results show that our approach provides a cleaner, more-principled, and more-flexible stochastic surface reconstruction pipeline.

Paperid:2573 poster

Authors:Pinzheng Wang, Juntao Li, Zecheng Tang, Haijia Gui, Min zhang

Title: Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Abstract:
Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding. However, recent studies indicate that even the best models lack true comprehension of their reasoning processes. In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models. We design a $\textit{\textbf{C}ritic-\textbf{D}iscernment \textbf{G}ame}~(\textbf{CDG})$ in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback. Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.

Paperid:2574 poster

Authors:Guan Huang, Tao Shu

Title: Federated Oriented Learning: A Practical One-Shot Personalized Federated Learning Framework

Abstract:
Personalized Federated Learning (PFL) has become a promising learning paradigm, enabling the training of high-quality personalized models through multiple communication rounds between clients and a central server. However, directly applying traditional PFL in real-world environments where communication is expensive, limited, or infeasible is challenging, as seen in Low Earth Orbit (LEO) satellite constellations, which face severe communication constraints due to their high mobility, limited contact windows. To address these issues, we introduce Federated Oriented Learning (FOL), a novel four-stage one-shot PFL algorithm designed to enhance local model performance by leveraging neighboring models within stringent communication constraints. FOL comprises model pretraining, model collection, model alignment (via fine-tuning, pruning, post fine-tuning, and ensemble refinement), and knowledge distillation stages. We establish two theoretical guarantees on empirical risk discrepancy between student and teacher models and the convergence of the distillation process. Extensive experiments on datasets Wildfire, Hurricane, CIFAR-10, CIFAR-100, and SVHN demonstrate that FOL consistently outperforms state-of-the-art one-shot Federated Learning (OFL) methods; for example, it achieves accuracy improvements of up to 39.24% over the baselines on the Wildfire dataset.

Paperid:2575 poster

Authors:Atsushi Nitanda, Anzelle Lee, Damian Tan Xing Kai, Mizuki Sakaguchi, Taiji Suzuki

Title: Propagation of Chaos for Mean-Field Langevin Dynamics and its Application to Model Ensemble

Abstract:
Mean-field Langevin dynamics (MFLD) is an optimization method derived by taking the mean-field limit of noisy gradient descent for two-layer neural networks in the mean-field regime. Recently, the propagation of chaos (PoC) for MFLD has gained attention as it provides a quantitative characterization of the optimization complexity in terms of the number of particles and iterations. A remarkable progress by Chen et al. (2022) showed that the approximation error due to finite particles remains uniform in time and diminishes as the number of particles increases. In this paper, by refining the defective log-Sobolev inequality---a key result from that earlier work---under the neural network training setting, we establish an improved PoC result for MFLD, which removes the exponential dependence on the regularization coefficient from the particle approximation term of the optimization complexity. As an application, we propose a PoC-based model ensemble strategy with theoretical guarantees.

Paperid:2576 poster

Authors:Edoardo Urettini, Antonio Carta

Title: Online Curvature-Aware Replay: Leveraging2ndOrder Information for Online Continual Learning

Abstract:
Online Continual Learning (OCL) models continuously adapt to nonstationary data streams, usually without task information. These settings are complex and many traditional CL methods fail, while online methods (mainly replay-based) suffer from instabilities after the task shift. To address this issue, we formalize replay-based OCL as a second-order online joint optimization with explicit KL-divergence constraints on replay data. We propose Online Curvature-Aware Replay (OCAR) to solve the problem: a method that leverages second-order information of the loss using a K-FAC approximation of the Fisher Information Matrix (FIM) to precondition the gradient. The FIM acts as a stabilizer to prevent forgetting while also accelerating the optimization in non-interfering directions. We show how to adapt the estimation of the FIM to a continual setting, stabilizing second-order optimization for non-iid data, uncovering the role of the Tikhonov damping in the stability-plasticity tradeoff. Empirical results show that OCAR outperforms state-of-the-art methods in continual metrics, achieving higher average accuracy throughout the training process in three different benchmarks.

Paperid:2577 poster

Authors:Li Ding, Hao Zhang, Wenrui Dai, Chenglin Li, Weijia Lu, ZHIFEI YANG, xiaodong Zhang, Xiaofeng Ma, Junni Zou, Hongkai Xiong

Title: LBI-FL: Low-Bit Integerized Federated Learning with Temporally Dynamic Bit-Width Allocation

Abstract:
Federated learning (FL) is greatly challenged by the communication bottleneck and computation limitation on clients. Existing methods based on quantization for FL cannot simultaneously reduce the uplink and downlink communication cost and mitigate the computation burden on clients. To address this problem, in this paper, we propose the first low-bit integerized federated learning (LBI-FL) framework that quantizes the weights, activations, and gradients to lower than INT8 precision to evidently reduce the communication and computational costs. Specifically, we achieve dynamical temporal bit-width allocation for weights, activations, and gradients along the training trajectory via reinforcement learning. An agent is trained to determine bit-width allocation by comprehensively considering the states like current bit-width, training stage, and quantization loss as the state. The agent efficiently trained on small-scale datasets can be well generalized to train varying network architectures on non-independent and identically distributed datasets. Furthermore, we demonstrated in theory that federated learning with gradient quantization achieves an equivalent convergence rate to FedAvg. The proposed LBI-FL can reduce the communication costs by 8 times compared to full-precision FL. Extensive experiments show that the proposed LBI-FL achieves a reduction of more than 50% BitOPs per client on average for FL with less than 2% accuracy loss compared to low-bit training with INT8 precision.

Paperid:2578 poster

Authors:Armin Kekić, Sergio Hernan Garrido Mejia, Bernhard Schölkopf

Title: Learning Joint Interventional Effects from Single-Variable Interventions in Additive Models

Abstract:
Estimating causal effects of joint interventions on multiple variables is crucial in many domains, but obtaining data from such simultaneous interventions can be challenging. Our study explores how to learn joint interventional effects using only observational data and single-variable interventions. We present an identifiability result for this problem, showing that for a class of nonlinear additive outcome mechanisms, joint effects can be inferred without access to joint interventional data. We propose a practical estimator that decomposes the causal effect into confounded and unconfounded contributions for each intervention variable. Experiments on synthetic data demonstrate that our method achieves performance comparable to models trained directly on joint interventional data, outperforming a purely observational estimator.

Paperid:2579 poster

Authors:Lang Pu, Jingjing Gu, Chao Lin, Xinyi Huang

Title: Janus: Dual-Server Multi-Round Secure Aggregation with Verifiability for Federated Learning

Abstract:
Secure Aggregation (SA) is a cornerstone of Federated Learning (FL), ensuring that user updates remain hidden from servers. The advanced Flamingo (S&P'23) has realized multi-round aggregation and improved efficiency. However, it still faces several key challenges: scalability issues with dynamic user participation, a lack of verifiability for server-side aggregation results, and vulnerability to Model Inconsistency Attacks (MIA) caused by a malicious server distributing inconsistent models. To address these issues, we propose $\textit{Janus}$, a generic SA scheme based on dual-server architecture. Janus ensures security against up to $n-2$ colluding clients (where $n$ is the total client count), which prevents privacy breaches for non-colluders. Additionally, Janus is model-independent, ensuring applicability across any FL model without specific adaptations. Furthermore, Janus introduces a new cryptographic primitive, Separable Homomorphic Commitment, which enables clients to efficiently verify the correctness of aggregation. Finally, extensive experiments show that Janus not only significantly enhances security but also reduces per-client communication and computation overhead from logarithmic to constant scale, with a tolerable impact on model performance.

Paperid:2580 poster

Authors:Yuhua Zhou, Ruifeng Li, Changhai Zhou, Fei Yang, Aimin PAN

Title: BSLoRA: Enhancing the Parameter Efficiency of LoRA with Intra-Layer and Inter-Layer Sharing

Abstract:
Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models (LLMs) to adapt to downstream tasks. However, in scenarios where multiple LoRA models are deployed simultaneously, standard LoRA introduces substantial trainable parameters, resulting in significant memory overhead and inference latency, particularly when supporting thousands of downstream tasks on a single server. While existing methods reduce stored parameters via parameter sharing, they fail to capture both local and global information simultaneously. To address this issue, we propose the Bi-Share LoRA (BSLoRA), which extends local LoRA with intra-LoRA and inter-LoRA parameter sharing to better capture local and global information. This approach reduces trainable parameters while maintaining or even enhancing model performance. Additionally, we design three transformation methods to improve the compatibility and collaborative efficiency of shared parameters with varying shapes, enhancing overall adaptability. Experiments on the 7B, 8B, and 13B versions of Llama show that BSLoRA, with only 44.59% of the parameters of standard LoRA, outperforms LoRA by approximately 0.33% on commonsense reasoning and 2.08% on MMLU benchmarks. Code is available athttps://github.com/yuhua-zhou/BSLoRA.git.

Paperid:2581 poster

Authors:Kunjie Ren, Luo Luo

Title: A Parameter-Free and Near-Optimal Zeroth-Order Algorithm for Stochastic Convex Optimization

Abstract:
This paper studies zeroth-order optimization for stochastic convex minimization problems. We propose a parameter-free stochastic zeroth-order method (POEM), which introduces a step-size scheme based on the distance over finite difference and an adaptive smoothing parameter. Our theoretical analysis shows that POEM achieves near-optimal stochastic zeroth-order oracle complexity. Furthermore, numerical experiments demonstrate that POEM outperforms existing zeroth-order methods in practice.

Paperid:2582 poster

Authors:Peng Wang, Yong Li, Lin Zhao, Xiu-Shen Wei

Title: Learning Attribute-Aware Hash Codes for Fine-Grained Image Retrieval via Query Optimization

Abstract:
Fine-grained hashing has become a powerful solution for rapid and efficient image retrieval, particularly in scenarios requiring high discrimination between visually similar categories. To enable each hash bit to correspond to specific visual attributes, we propose a novel method that harnesses learnable queries for attribute-aware hash code learning. This method deploys a tailored set of queries to capture and represent nuanced attribute-level information within the hashing process, thereby enhancing both the interpretability and relevance of each hash bit. Building on this query-based optimization framework, we incorporate an auxiliary branch to help alleviate the challenges of complex landscape optimization often encountered with low-bit hash codes. This auxiliary branch models high-order attribute interactions, reinforcing the robustness and specificity of the generated hash codes. Experimental results on benchmark datasets demonstrate that our method generates attribute-aware hash codes and consistently outperforms state-of-the-art techniques in retrieval accuracy and robustness, especially for low-bit hash codes, underscoring its potential in fine-grained image hashing tasks.

Paperid:2583 poster

Authors:Xuankun Rong, Jianshu Zhang, Kun He, Mang Ye

Title: CAN: Leveraging Clients As Navigators for Generative Replay in Federated Continual Learning

Abstract:
Generative replay (GR) has been extensively validated in continual learning as a mechanism to synthesize data and replay past knowledge to mitigate forgetting. By leveraging synthetic rather than real data for the replay, GR has been adopted in some federated continual learning (FCL) approaches to ensure the privacy of client-side data. While existing GR-based FCL approaches have introduced improvements, none of their enhancements specifically take into account the unique characteristics of federated learning settings. Beyond privacy constraints, what other fundamental aspects of federated learning should be explored in the context of FCL? In this work, we explore the potential benefits that come from emphasizing the role of clients throughout the process. We begin by highlighting two key observations: (a) Client Expertise Superiority, where clients, rather than the server, act as domain experts, and (b) Client Forgetting Variance, where heterogeneous data distributions across clients lead to varying levels of forgetting. Building on these insights, we propose CAN (Clients As Navigators), highlighting the pivotal role of clients in both data synthesis and data replay. Extensive evaluations demonstrate that this client-centric approach achieves state-of-the-art performance. Notably, it requires a smaller buffer size, reducing storage overhead and enhancing computational efficiency.

Paperid:2584 poster

Authors:Jiayu Zhang, Xinyi Wang, Zhibo Jin, Zhiyu Zhu, Jianlong Zhou, Fang Chen, Huaming Chen

Title: Splitting & Integrating: Out-of-Distribution Detection via Adversarial Gradient Attribution

Abstract:
Out-of-distribution (OOD) detection is essential for enhancing the robustness and security of deep learning models in unknown and dynamic data environments. Gradient-based OOD detection methods, such as GAIA, analyse the explanation pattern representations of in-distribution (ID) and OOD samples by examining the sensitivity of model outputs w.r.t. model inputs, resulting in superior performance compared to traditional OOD detection methods. However, we argue that the non-zero gradient behaviors of OOD samples do not exhibit significant distinguishability, especially when ID samples are perturbed by random perturbations in high-dimensional spaces, which negatively impacts the accuracy of OOD detection. In this paper, we propose a novel OOD detection method called \textbf{S & I} based on layer \textbf{S}plitting and gradient \textbf{I}ntegration via Adversarial Gradient Attribution. Specifically, our approach involves splitting the model's intermediate layers and iteratively updating adversarial examples layer-by-layer. We then integrate the attribution gradients from each intermediate layer along the attribution path from adversarial examples to the actual input, yielding true explanation pattern representations for both ID and OOD samples. Experiments demonstrate that our S & I algorithm achieves state-of-the-art results, with the average FPR95 of 29.05% (ResNet34)/38.61% (WRN40) and 37.31% (BiT-S) on the CIFAR100 and ImageNet benchmarks, respectively. Our code is available at:https://github.com/LMBTough/S-I}{https://github.com/LMBTough/S-I

Paperid:2585 poster

Authors:Tal Lancewicki, Yishay Mansour

Title: Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback

Abstract:
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the firstoptimalregret bound of $\tilde \Theta(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.

Paperid:2586 poster

Authors:Rui Sun, Huayu Mai, Wangkai Li, Yujia Chen, Naisong Luo, Yuan Wang, Tianzhu Zhang

Title: Beyond Confidence: Exploiting Homogeneous Pattern for Semi-Supervised Semantic Segmentation

Abstract:
The critical challenge of semi-supervised semantic segmentation lies in how to fully exploit a large volume of unlabeled data to improve the model's generalization performance for robust segmentation. Existing methods mainly rely on confidence-based scoring functions in the prediction space to filter pseudo labels, which suffer from the inherent trade-off between true and false positive rates. In this paper, we carefully design an agent construction strategy to build clean sets of correct (positive) and incorrect (negative) pseudo labels, and propose the Agent Score function (AgScore) to measure the consensus between candidate pixels and these sets. In this way, AgScore takes a step further to capture homogeneous patterns in the embedding space, conditioned on clean positive/negative agents stemming from the prediction space, without sacrificing the merits of confidence score, yielding better trad-off. We provide theoretical analysis to understand the mechanism of AgScore, and demonstrate its effectiveness by integrating it into three semi-supervised segmentation frameworks on Pascal VOC, Cityscapes, and COCO datasets, showing consistent improvements across all data partitions.

Paperid:2587 poster

Authors:Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic

Title: Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Abstract:
Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we proposeNon-Stationary Direct Preference Optimisation (NS-DPO)that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

Paperid:2588 poster

Authors:Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, Dilek Hakkani Tur

Title: Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs

Abstract:
Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) by enabling detailed step-by-step solutions. However, due to the verbosity of LLMs, the resulting reasoning chains can be long, making it harder to verify the reasoning steps and trace issues resulting from dependencies between the steps that may be farther away in the sequence of steps. Importantly, mathematical reasoning allows each step to be derived from a small set of premises, which are a subset of the preceding steps in the reasoning chain. In this paper, we present a framework that identifies the premises for each step, to improve the evaluation of reasoning. We restructure conventional linear reasoning chains into Premise Augmented Reasoning Chains (PARC) by introducing premise links, resulting in a directed acyclic graph where the nodes are the steps and the edges are the premise links. Through experiments with a PARC-based dataset that we built, namely (Premises and ERrors identification in LLMs), we demonstrate that LLMs can reliably identify premises within complex reasoning chains. In particular, even open-source LLMs achieve 90% recall in premise identification. We also show that PARC helps to identify errors in reasoning chains more reliably. The accuracy of error identification improves by 6% to 16% absolute when step-by-step verification is carried out in PARC under the premises. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks and open new avenues for improving the reliability of LLM-based reasoning evaluations.

Paperid:2589 poster

Authors:Chen-Chia Chang, Wan-Hsuan Lin, Yikang Shen, Yiran Chen, Xin Zhang

Title: LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation

Abstract:
Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due toO(|V|2)token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based analog topology generation. SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity toO(|V|), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our experiments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5% improvement. These advancements establish LaMAGIC2 as a robust framework for analog topology generation.

Paperid:2590 poster

Authors:Lee Cohen, Connie Hong, Jack Hsieh, Judy Hanwen Shen

Title: Two Tickets are Better than One: Fair and Accurate Hiring Under Strategic LLM Manipulations

Abstract:
In an era of increasingly capable foundation models, job seekers are turning to generative AI tools to enhance their application materials. However, unequal access to and knowledge about generative AI tools can harm both employers and candidates by reducing the accuracy of hiring decisions and giving some candidates an unfair advantage. To address these challenges, we introduce a new variant of the strategic classification framework tailored to manipulations performed using large language models, accommodating varying levels of manipulations and stochastic outcomes. We propose a "two-ticket" scheme, where the hiring algorithm applies an additional manipulation to each submitted resume and considers this manipulated version together with the original submitted resume. We establish theoretical guarantees for this scheme, showing improvements for both the fairness and accuracy of hiring decisions when the true positive rate is maximized subject to a no false positives constraint. We further generalize this approach to ann-ticket scheme and prove that hiring outcomes converge to a fixed, group-independent decision, eliminating disparities arising from differential LLM access. Finally, we empirically validate our framework and the performance of our two-ticket scheme on real resumes using an open-source resume screening tool.

Paperid:2591 poster

Authors:Qi Wang, Yuan Mi, Wang Haoyun, Yi Zhang, Ruizhi Chengze, Hongsheng Liu, Ji-Rong Wen, Hao Sun

Title: MultiPDENet: PDE-embedded Learning with Multi-time-stepping for Accelerated Flow Simulation

Abstract:
Solving partial differential equations (PDEs) by numerical methods meet computational cost challenge for getting the accurate solution since fine grids and small time steps are required. Machine learning can accelerate this process, but struggle with weak generalizability, interpretability, and data dependency, as well as suffer in long-term prediction. To this end, we propose a PDE-embedded network with multiscale time stepping (MultiPDENet), which fuses the scheme of numerical methods and machine learning, for accelerated simulation of flows. In particular, we design a convolutional filter based on the structure of finite difference stencils with a small number of parameters to optimize, which estimates the equivalent form of spatial derivative on a coarse grid to minimize the equation's residual. A Physics Block with a 4th-order Runge-Kutta integrator at the fine time scale is established that embeds the structure of PDEs to guide the prediction. To alleviate the curse of temporal error accumulation in long-term prediction, we introduce a multiscale time integration approach, where a neural network is used to correct the prediction error at a coarse time scale. Experiments across various PDE systems, including the Navier-Stokes equations, demonstrate that MultiPDENet can accurately predict long-term spatiotemporal dynamics, even given small and incomplete training data, e.g., spatiotemporally down-sampled datasets. MultiPDENet achieves the state-of-the-art performance compared with other neural baseline models, also with clear speedup compared to classical numerical methods

Paperid:2592 poster

Authors:Mikołaj Małkiński, Szymon Pawlonka, Jacek Mańdziuk

Title: Reasoning Limitations of Multimodal Large Language Models. A case study of Bongard Problems

Abstract:
Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLM-suited solution strategies and testing $4$ proprietary and $4$ open-access models on $3$ BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR limitations. Code and dataset are available at:https://github.com/pavonism/bongard-rwr

Paperid:2593 poster

Authors:Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, Ping Luo

Title: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Abstract:
Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes athttps://github.com/OpenGVLab/PhyGenBench

Paperid:2594 poster

Authors:Zeyu Gan, Yun Liao, Yong Liu

Title: Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning

Abstract:
Test-time scaling, which is also often referred to asslow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code athttps://github.com/ZyGan1999/Snowball-Errors-and-Probability.

Paperid:2595 poster

Authors:Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong

Title: T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through over-sampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. The model weights and training data are publicly available athttps://github.com/THUDM/T1.

Paperid:2596 poster

Authors:Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia

Title: GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Abstract:
Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curateGRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduceGRADEO, one of the first specifically designed video evaluation models, whichgradesAI-generatedvideosfor explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.

Paperid:2597 poster

Authors:Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song

Title: Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities

Abstract:
We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any word-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.

Paperid:2598 poster

Authors:Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, Qi Lei

Title: Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

Abstract:
Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyze W2S in the ridgeless regression setting from a variance reduction perspective. For a strong student-weak teacher pair with sufficiently expressive low-dimensional feature subspaces $\mathcal{V}_s, \mathcal{V}_w$, we provide an exact characterization of the variance that dominates the generalization error of W2S. This unveils a virtue of discrepancy between the strong and weak models in W2S: the variance of the weak teacher is inherited by the strong student in $\mathcal{V}_s \cap \mathcal{V}_w$, while reduced by a factor of $\mathrm{dim}(\mathcal{V}_s)/N$ in the subspace of discrepancy $\mathcal{V}_w \setminus \mathcal{V}_s$ with $N$ pseudo-labels for W2S. Our analysis further casts light on the sample complexities and the scaling of performance gap recovery in W2S. The analysis is supported by experiments on synthetic regression problems, as well as real vision and NLP tasks.

Paperid:2599 poster

Authors:Haotian Si, Changhua Pei, Jianhui li, Dan Pei, Gaogang Xie

Title: CMoS: Rethinking Time Series Prediction Through the Lens of Chunk-wise Spatial Correlations

Abstract:
Recent advances in lightweight time series forecasting models suggest the inherent simplicity of time series forecasting tasks. In this paper, we present CMoS, a super-lightweight time series forecasting model. Instead of learning the embedding of the shapes, CMoS directly models the spatial correlations between different time series chunks. Additionally, we introduce a Correlation Mixing technique that enables the model to capture diverse spatial correlations with minimal parameters, and an optional Periodicity Injection technique to ensure faster convergence. Despite utilizing as low as 1% of the lightweight model DLinear's parameters count, experimental results demonstrate that CMoS outperforms existing state-of-the-art models across multiple datasets. Furthermore, the learned weights of CMoS exhibit great interpretability, providing practitioners with valuable insights into temporal structures within specific application scenarios.

Paperid:2600 poster

Authors:Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, JunqiCheng, Yuefeng Zhu, FangYuan Zou

Title: MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Abstract:
In recent years, while generative AI has advanced significantly in image generation, video generation continues to face challenges in controllability, length, and detail quality, which hinder its application. We present MimicMotion, a framework for generating high-quality human videos of arbitrary length using motion guidance. Our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which reduces image distortion in key regions. Lastly, we propose a progressive latent fusion strategy to generate long and smooth videos. Experiments demonstrate the effectiveness of our approach in producing high-quality human motion videos. Videos and comparisons are available athttps://tencent.github.io/MimicMotion.

Paperid:2601 poster

Authors:Runzhong Wang, Rui-Xi Wang, Mrunali Manjrekar, Connor W. Coley

Title: Neural Graph Matching Improves Retrieval Augmented Generation in Molecular Machine Learning

Abstract:
Molecular machine learning has gained popularity with the advancements of geometric deep learning. In parallel, retrieval-augmented generation has become a principled approach commonly used with language models. However, the optimal integration of retrieval augmentation into molecular machine learning remains unclear. Graph neural networks stand to benefit from clever matching to understand the structural alignment of retrieved molecules to a query molecule. Neural graph matching offers a compelling solution by explicitly modeling node and edge affinities between two structural graphs while employing a noise-robust, end-to-end neural network to learn affinity metrics. We apply this approach to mass spectrum simulation and introduce MARASON, a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network. Experimental results highlight the effectiveness of our design, with MARASON achieving 27% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches. Code is publicly available athttps://github.com/coleygroup/ms-pred.

Paperid:2602 poster

Authors:Kaicheng Fu, Changde Du, Jie Peng, Kunpeng Wang, Shuangchen Zhao, Xiaoyu Chen, Huiguang He

Title: EmoGrowth: Incremental Multi-label Emotion Decoding with Augmented Emotional Relation Graph

Abstract:
Emotion recognition systems face significant challenges in real-world applications, where novel emotion categories continually emerge and multiple emotions often co-occur. This paper introduces multi-label fine-grained class incremental emotion decoding, which aims to develop models capable of incrementally learning new emotion categories while maintaining the ability to recognize multiple concurrent emotions. We propose an Augmented Emotional Semantics Learning (AESL) framework to address two critical challenges: past- and future-missing partial label problems. AESL incorporates an augmented Emotional Relation Graph (ERG) for reliable soft label generation and affective dimension-based knowledge distillation for future-aware feature learning. We evaluate our approach on three datasets spanning brain activity and multimedia domains, demonstrating its effectiveness in decoding up to 28 fine-grained emotion categories. Results show that AESL significantly outperforms existing methods while effectively mitigating catastrophic forgetting. Our code is available athttps://github.com/ChangdeDu/EmoGrowth.

Paperid:2603 poster

Authors:Hang Gao, Huang Wenxuan, Fengge Wu, Zhao Junsuo, Changwen Zheng, Huaping Liu

Title: LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification

Abstract:
The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module.

Paperid:2604 poster

Authors:Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang

Title: Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

Abstract:
Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available athttps://github.com/CityMind-Lab/ICML25-TimeVLM.

Paperid:2605 poster

Authors:Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang

Title: Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding

Abstract:
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose conTextualized equivariAntPositionEncoding (TAPE), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.

Paperid:2606 poster

Authors:Lang Feng, Weihao Tan, Zhiyi Lyu, Longtao Zheng, Haiyang Xu, Ming Yan, Fei Huang, Bo An

Title: Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning

Abstract:
Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available athttps://github.com/langfengQ/CoSo.

Paperid:2607 poster

Authors:Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan

Title: FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making

Abstract:
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website ishttps://sites.google.com/view/founder-rl.

Paperid:2608 poster

Authors:Omar Bennouna, Jiawei Zhang, Saurabh Amin, Asuman E. Ozdaglar

Title: Contextual Optimization Under Model Misspecification: A Tractable and Generalizable Approach

Abstract:
Contextual optimization problems are prevalent in decision-making applications where historical data and contextual features are used to learn predictive models that inform optimal actions. However, practical applications often suffer from model misspecification due to incomplete knowledge of the underlying data-generating process, leading to suboptimal decisions. Existing approaches primarily address the well-specified case, leaving a critical gap in handling misspecified models. In this paper, we propose a novel Integrated Learning and Optimization (ILO) framework that explicitly accounts for model misspecification by introducing a tractable surrogate loss function with strong theoretical guarantees on generalizability, tractability, and optimality. Our surrogate loss aligns with the true decision performance objective, ensuring robustness to misspecification without imposing restrictive assumptions. The proposed approach effectively mitigates the challenges of non-convexity and non-smoothness in the target loss function, leading to efficient optimization procedures. We provide rigorous theoretical analysis and experimental validation, demonstrating superior performance compared to state-of-the-art methods. Our work offers a principled solution to the practically relevant challenge of model misspecification in contextual optimization.

Paperid:2609 poster

Authors:Chamika Sudusinghe, Gerasimos Gerogiannis, Damitha Lenadora, Charles Block, Josep Torrellas, Charith Mendis

Title: COGNATE: Acceleration of Sparse Tensor Programs on Emerging Hardware using Transfer Learning

Abstract:
Sparse tensor programs are essential in deep learning and graph analytics, driving the need for optimized processing. To meet this demand, specialized hardware accelerators are being developed. Optimizing these programs for accelerators is challenging for two reasons: program performance is highly sensitive to variations in sparse inputs, and early-stage accelerators rely on expensive simulators. Therefore, ML-based cost models used for optimizing such programs on general-purpose hardware are often ineffective for early-stage accelerators, as they require large datasets for proper training. To this end, we introduce COGNATE, a novel framework that leverages inexpensive data samples from general-purpose hardware (e.g., CPUs) to train cost models, followed by few-shot fine-tuning on emerging hardware. COGNATE exploits the homogeneity of input features across hardware platforms while effectively mitigating heterogeneity, enabling cost model training with just 5% of the data samples needed by accelerator-specific models to achieve comparable performance. We conduct extensive experiments to demonstrate that COGNATE outperforms existing techniques, achieving average speedups of 1.47× (up to 5.46×) for SpMM and 1.39× (up to 4.22×) for SDDMM.

Paperid:2610 poster

Authors:Lan Li, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan

Title: Addressing Imbalanced Domain-Incremental Learning through Dual-Balance Collaborative Experts

Abstract:
Domain-Incremental Learning (DIL) focuses on continual learning in non-stationary environments, requiring models to adjust to evolving domains while preserving historical knowledge. DIL faces two critical challenges in the context of imbalanced data: intra-domain class imbalance and cross-domain class distribution shifts. These challenges significantly hinder model performance, as intra-domain imbalance leads to underfitting of few-shot classes, while cross-domain shifts require maintaining well-learned many-shot classes and transferring knowledge to improve few-shot class performance in old domains. To overcome these challenges, we introduce the Dual-Balance Collaborative Experts (DCE) framework. DCE employs a frequency-aware expert group, where each expert is guided by specialized loss functions to learn features for specific frequency groups, effectively addressing intra-domain class imbalance. Subsequently, a dynamic expert selector is learned by synthesizing pseudo-features through balanced Gaussian sampling from historical class statistics. This mechanism navigates the trade-off between preserving many-shot knowledge of previous domains and leveraging new data to improve few-shot class performance in earlier tasks. Extensive experimental results on four benchmark datasets demonstrate DCE’s state-of-the-art performance.

Paperid:2611 poster

Authors:Yilong Song, Peijin Li, Bin Gao, Kun Yuan

Title: Distributed Retraction-Free and Communication-Efficient Optimization on the Stiefel Manifold

Abstract:
Optimization problems on the Stiefel manifold, ranging from principal component analysis to enhancing neural network robustness, are ubiquitous in machine learning. The Landing algorithm avoids computationally expensive retraction operations on manifolds, making it highly competitive for large-scale problems. This paper extends this method to distributed settings, introducingEF-Landing, the first retraction-free and communication-efficient algorithm for distributed stochastic optimization on the Stiefel manifold. By incorporating communication compression and error feedback, EF-Landing ensures convergence and constraint feasibility while significantly reducing communication overhead. We provide sharp convergence guarantees, demonstrating that EF-Landing achieves the same asymptotic linear speedup convergence rate as existing methods without communication compression. Furthermore, our analysis is highly versatile, applying to both deterministic and stochastic settings and encompassing algorithms based on gradient descent or momentum-based gradient descent. We also generalize EF-Landing to operate on block-wise Stiefel manifolds, enabling greater flexibility for structured constraints. Extensive numerical experiments validate our theoretical results.

Paperid:2612 poster

Authors:Mohit Pandey, Gopeshh Subbaraj, Artem Cherkasov, Martin Ester, Emmanuel Bengio

Title: Pretraining Generative Flow Networks with Inexpensive Rewards for Molecular Graph Generation

Abstract:
Generative Flow Networks (GFlowNets) have recently emerged as a suitable framework for generating diverse and high-quality molecular structures by learning from rewards treated as unnormalized distributions. Previous works in this framework often restrict exploration by using predefined molecular fragments as building blocks, limiting the chemical space that can be accessed. In this work, we introduce Atomic GFlowNets (A-GFNs), a foundational generative model leveraging individual atoms as building blocks to explore drug-like chemical space more comprehensively. We propose an unsupervised pre-training approach using drug-like molecule datasets, which teaches A-GFNs about inexpensive yet informative molecular descriptors such as drug-likeliness, topological polar surface area, and synthetic accessibility scores. These properties serve as proxy rewards, guiding A-GFNs towards regions of chemical space that exhibit desirable pharmacological properties. We further implement a goal-conditioned finetuning process, which adapts A-GFNs to optimize for specific target properties. In this work, we pretrain A-GFN on a subset of ZINC dataset, and by employing robust evaluation metrics we show the effectiveness of our approach when compared to other relevant baseline methods for a wide range of drug design tasks. The code is accessible athttps://github.com/diamondspark/AGFN.

Paperid:2613 poster

Authors:Zihao Wang, Yibo Jiang, Jiahao Yu, Heqing Huang

Title: The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

Abstract:
Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role—a concept we callrole separation—is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examinerole-separation learning: the process of teaching LLMs to robustly distinguish system and user tokens. Through asimple, controlled experimental framework, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcinginvariant signalsthat mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, modifying position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.

Paperid:2614 poster

Authors:Zhun Deng, Thomas P Zollo, Benjamin Eyre, Amogh Inamdar, David Madras, Richard Zemel

Title: QuEst: Enhancing Estimates of Quantile-Based Distributional Measures Using Model Predictions

Abstract:
As machine learning models grow increasingly competent, their predictions can supplement scarce or expensive data in various important domains. In support of this paradigm, algorithms have emerged to combine a small amount of high-fidelity observed data with a much larger set of imputed model outputs to estimate some quantity of interest. Yet current hybrid-inference tools target only means or single quantiles, limiting their applicability for many critical domains and use cases. We present QuEst, a principled framework to merge observed and imputed data to deliver point estimates and rigorous confidence intervals for a wide family of quantile-based distributional measures. QuEst covers a range of measures, from tail risk (CVaR) to population segments such as quartiles, that are central to fields such as economics, sociology, education, medicine, and more. We extend QuEst to multidimensional metrics, and introduce an additional optimization technique to further reduce variance in this and other hybrid estimators. We demonstrate the utility of our framework through experiments in economic modeling, opinion polling, and language model auto-evaluation.

Paperid:2615 poster

Authors:Junyan Liu, Arnab Maiti, Artin Tajdini, Kevin Jamieson, Lillian J. Ratliff

Title: Learning to Incentivize in Repeated Principal-Agent Problems with Adversarial Agent Arrivals

Abstract:
We initiate the study of a repeated principal-agent problem over a finite horizon $T$, where a principal sequentially interacts with $K\geq 2$ types of agents arriving in anadversarialorder. At each round, the principal strategically chooses one of the $N$ arms to incentivize for an arriving agent ofunknown type. The agent then chooses an arm based on its own utility and the provided incentive, and the principal receives a corresponding reward. The objective is to minimize regret against the best incentive in hindsight. Without prior knowledge of agent behavior, we show that the problem becomes intractable, leading to linear regret. We analyze two key settings where sublinear regret is achievable. In the first setting, the principal knows the arm each agent type would select greedily for any given incentive. Under this setting, we propose an algorithm that achieves a regret bound of $\mathcal{O}(\min(\sqrt{KT\log N},K\sqrt{T}))$ and provide a matching lower bound up to a $\log K$ factor. In the second setting, an agent's response varies smoothly with the incentive and is governed by a Lipschitz constant $L$. Under this setting, we show that there is an algorithm with a regret bound of $\tilde{\mathcal{O}}((LN)^{1/3}T^{2/3})$ and establish a matching lower bound up to logarithmic factors. Finally, we extend our algorithmic results for both settings by allowing the principal to incentivize multiple arms simultaneously in each round.

Paperid:2616 poster

Authors:Aaron Wenteler, Martina Occhetta, Nikhil Branson, Victor Curean, Magdalena Huebner, William Dee, William Connell, Siu Pui Chung, Alex Hawkins-Hooker, Yasha Ektefaie, César Miguel Valdez Córdova, Amaya Gallagher-Syed

Title: PertEval-scFM: Benchmarking Single-Cell Foundation Models for Perturbation Effect Prediction

Abstract:
In silicomodeling of transcriptional responses to perturbations is crucial for advancing our understanding of cellular processes and disease mechanisms. We present PertEval-scFM, a standardized framework designed to evaluate models for perturbation effect prediction. We apply PertEval-scFM to benchmark zero-shot single-cell foundation model (scFM) embeddings against baseline models to assess whether these contextualized representations enhance perturbation effect prediction. Our results show that scFM embeddings offer limited improvement over simple baseline models in the zero-shot setting, particularly under distribution shift. Overall, this study provides a systematic evaluation of zero-shot scFM embeddings for perturbation effect prediction, highlighting the challenges of this task and the limitations of current-generation scFMs. Our findings underscore the need for specialized models and high-quality datasets that capture a broader range of cellular states. Source code and documentation can be found at:https://github.com/aaronwtr/PertEval.

Paperid:2617 poster

Authors:Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Pang Wei Koh, Chuan-Sheng Foo, Bryan Kian Hsiang Low

Title: NICE Data Selection for Instruction Tuning in LLMs with Non-differentiable Evaluation Metric

Abstract:
Curating data for instruction tuning is crucial for enhancing the performance of large language models (LLMs). This work aims to select training data for instruction tuning to improve the LLM performance on specific tasks. Existing methods often rely on next-token prediction (NTP) loss as a proxy for target task performance due to the non-differentiable nature of performance evaluation metrics. They select training data points that are most helpful in reducing validation loss. However, there is a discrepancy between minimizing NTP loss and maximizing performance (e.g., code pass rate in code generation). To remedy this, we introduce a novel Non-differentiable evaluation metric-based InfluenCe Estimation (NICE), which leverages the policy gradient to select the training data that improves the performance. Moreover, NICE can perform data selection in the absence of labels (ground-truth responses) when the evaluation metrics do not require labels (e.g., a reward model can output reward scores without supervision from labels). Experimental results show that our approach outperforms existing data selection baselines that use NTP loss in diverse and realistic scenarios. Notably, subsets selected by NICE often produce models that outperform those trained on the full dataset. Our code is available athttps://github.com/JTWang2000/NICE.

Paperid:2618 poster

Authors:Alexander Capstick, Rahul Krishnan, Payam Barnaghi

Title: AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling

Abstract:
Large language models (LLMs) acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency often hinder their direct application for predictive tasks where privacy and interpretability are paramount. In fields such as healthcare, biology, and finance, specialised and interpretable linear models still hold considerable value. In such domains, labelled data may be scarce or expensive to obtain. Well-specified prior distributions over model parameters can reduce the sample complexity of learning through Bayesian inference; however, eliciting expert priors can be time-consuming. We therefore introduce AutoElicit to extract knowledge from LLMs and construct priors for predictive models. We show these priors are informative and can be refined using natural language. We perform a careful study contrasting AutoElicit with in-context learning and demonstrate how to perform model selection between the two methods. We find that AutoElicit yields priors that can substantially reduce error over uninformative priors, using fewer labels, and consistently outperform in-context learning. We show that AutoElicit saves over 6 months of labelling effort when building a new predictive model for urinary tract infections from sensor recordings of people living with dementia.

Paperid:2619 poster

Authors:Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, Dylan J Foster

Title: Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Abstract:
Recent work on inference-time alignment has established benefits of increasing inference-time computation in language models, but naively scaling compute through techniques like Best-of-N sampling can cause performance to degrade due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we formalize inference-time alignment as improving a pre-trained policy’s responses for a prompt of interest, given access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy’s coverage over high-quality responses for performance and compute scaling: (1) We show that Best-of-N alignment with an ideal N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions; (2) We introduce InferenceTimePessimism, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing pessimism in the face of uncertainty; we prove that its performance is optimal and scaling-monotonic, i.e., does ot degrade as N increases. We complement our theoretical results with experiments that demonstrate the practicality of our algorithm across a variety of tasks and models.

Paperid:2620 poster

Authors:Sen Xing, Muyan Zhong, Zeqiang Lai, Liangchen Li, Jiawen Liu, Yaohui Wang, Jifeng Dai, Wenhai Wang

Title: MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

Abstract:
In this work, we explore a cost-effective framework for multilingual image generation. We find that, unlike models tuned on high-quality images with multilingual annotations, leveraging text encoders pre-trained on widely available, noisy Internet image-text pairs significantly enhances data efficiency in text-to-image (T2I) generation across multiple languages. Based on this insight, we introduce MuLan, Multi-Language adapter, a lightweight language adapter with fewer than 20M parameters, trained alongside a frozen text encoder and image diffusion model. Compared to previous multilingual T2I models, this framework offers: (1) Cost efficiency. Using readily accessible English data and off-the-shelf multilingual text encoders minimizes the training cost; (2) High performance. Achieving comparable generation capabilities in over 110 languages with CLIP similarity scores nearly matching those in English (39.57 for English vs. 39.61 for other languages); and (3) Broad applicability. Seamlessly integrating with compatible community tools like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.

Paperid:2621 poster

Authors:Yishan Shen, Yuyang Ye, Hui Xiong, Yong Chen

Title: SAFER: A Calibrated Risk-Aware Multimodal Recommendation Model for Dynamic Treatment Regimes

Abstract:
Dynamic treatment regimes (DTRs) are critical to precision medicine, optimizing long-term outcomes through personalized, real-time decision-making in evolving clinical contexts, but require careful supervision for unsafe treatment risks. Existing efforts rely primarily on clinician-prescribed gold standards despite the absence of a known optimal strategy, and predominantly using structured EHR data without extracting valuable insights from clinical notes, limiting their reliability for treatment recommendations. In this work, we introduce SAFER, a calibrated risk-aware tabular-language recommendation framework for DTR that integrates both structured EHR and clinical notes, enabling them to learn from each other, and addresses inherent label uncertainty by assuming ambiguous optimal treatment solution for deceased patients. Moreover, SAFER employs conformal prediction to provide statistical guarantees, ensuring safe treatment recommendations while filtering out uncertain predictions. Experiments on two publicly available sepsis datasets demonstrate that SAFER outperforms state-of-the-art baselines across multiple recommendation metrics and counterfactual mortality rate, while offering robust formal assurances. These findings underscore SAFER’s potential as a trustworthy and theoretically grounded solution for high-stakes DTR applications.

Paperid:2622 poster

Authors:Jiaze Xu, Shiyu Xia, Xu Yang, Jiaqi Lv, Xin Geng

Title: Learngene Tells You How to Customize: Task-Aware Parameter Initialization at Flexible Scales

Abstract:
Appropriate parameter initialization strategies are essential for reducing the high computational costs of training large pretrained models in various task scenarios. Graph HyperNetwork (GHN), a parameter initialization method, has recently demonstrated strong performance in initializing models. However, GHN still faces several challenges, including limited effectiveness in initializing larger models, poor performance on smaller datasets, and the requirement of task-specific GHN training, where each new task necessitates retraining the GHN model, leading to increased computational and storage overhead. To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method calledTask-AwareLearngene (TAL). Briefly, our approach pretrains a TAL model under the guidance of a well-trained model and then performs multi-task tuning to obtain a shared TAL model that enables parameter prediction based on both model architectures and task-specific characteristics. Extensive experiments show the superiority of TAL. Models initialized with TAL outperform those initialized using GHN method by an average of 24.39% in terms of accuracy across Decathlon datasets.

Paperid:2623 poster

Authors:Chinmay Savadikar, Xi Song, Tianfu Wu

Title: WeGeFT: Weight‑Generative Fine‑Tuning for Multi‑Faceted Efficient Adaptation of Large Models

Abstract:
Fine-tuning large pretrained Transformer models can focus on either introducing a small number of new learnable parameters (parameter efficiency) or editing representations of a small number of tokens using lightweight modules (representation efficiency). While the pioneering method LoRA (Low-Rank Adaptation) inherently balances parameter, compute, and memory efficiency, many subsequent variants trade off compute and memory efficiency and/or performance to further reduce fine-tuning parameters. To address this limitation and unify parameter-efficient and representation-efficient fine-tuning, we propose Weight-Generative Fine-Tuning (WeGeFT, pronouncedwee-gift), a novel approach thatlearns to generate fine-tuning weights directly from the pretrained weights. WeGeFT employs a simple low-rank formulation consisting of two linear layers, either shared across multiple layers of the pretrained model or individually learned for different layers. This design achieves multi-faceted efficiency in parameters, representations, compute, and memory, while maintaining or exceeding the performance of LoRA and its variants. Extensive experiments on commonsense reasoning, arithmetic reasoning, instruction following, code generation, and visual recognition verify the effectiveness of our proposed WeGeFT.

Paperid:2624 poster

Authors:Walter Mayor, Johan Obando-Ceron, Aaron Courville, Pablo Samuel Castro

Title: The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks

Abstract:
The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.

Paperid:2625 poster

Authors:Kevin Han Huang, Ni Zhan, Elif Ertekin, Peter Orbanz, Ryan P Adams

Title: Diagonal Symmetrization of Neural Network Solvers for the Many-Electron Schrödinger Equation

Abstract:
Incorporating group symmetries into neural networks has been a cornerstone of success in many AI-for-science applications. Diagonal groups of isometries, which describe the invariance under a simultaneous movement of multiple objects, arise naturally in many-body quantum problems. Despite their importance, diagonal groups have received relatively little attention, as they lack a natural choice of invariant maps except in special cases. We study different ways of incorporating diagonal invariance in neural network ansatze trained via variational Monte Carlo methods, and consider specifically data augmentation, group averaging and canonicalization. We show that, contrary to standard ML setups, in-training symmetrization destabilizes training and can lead to worse performance. Our theoretical and numerical results indicate that this unexpected behavior may arise from a unique computational-statistical tradeoff not found in standard ML analyses of symmetrization. Meanwhile, we demonstrate that post hoc averaging is less sensitive to such tradeoffs and emerges as a simple, flexible and effective method for improving neural network solvers.

Paperid:2626 poster

Authors:Eric Frankel, Kshitij Kulkarni, Dmitriy Drusvyatskiy, Sewoong Oh, Lillian J. Ratliff

Title: Finite-Time Convergence Rates in Stochastic Stackelberg Games with Smooth Algorithmic Agents

Abstract:
Decision-makers often adaptively influence downstream competitive agents' behavior to minimize their cost, yet in doing so face critical challenges:(i)decision-makers might nota prioriknow the agents' objectives;(ii)agents mightlearntheir responses, introducing stochasticity and non-stationarity into the decision-making process; and(iii)there may be additional non-strategic environmental stochasticity. Characterizing convergence of this complex system is contingent on how the decision-maker controls for the tradeoff between the induced drift and additional noise from the learning agent behavior and environmental stochasticity. To understand how the learning agents' behavior is influenced by the decision-maker’s actions, we first consider a decision-maker that deploys an arbitrary sequence of actions which induces a sequence of games and corresponding equilibria. We characterize how the drift and noise in the agents' stochastic algorithms decouples from their optimization error. Leveraging this decoupling and accompanying finite-time efficiency estimates, we design decision-maker algorithms that control the induced drift relative to the agent noise. This enables efficient finite-time tracking of game theoretic equilibrium concepts that adhere to the incentives of the players' collective learning processes.

Paperid:2627 poster

Authors:Roman Abramov, Felix Steinbauer, Gjergji Kasneci

Title: Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers

Abstract:
Transformers have achieved great success in numerous NLP tasks but continue to exhibit notable gaps in multi-step factual reasoning, especially when real-world knowledge is sparse. Recent advances in grokking have demonstrated that neural networks can transition from memorizing to perfectly generalizing once they detect underlying logical patterns -- yet these studies have primarily used small, synthetic tasks. In this paper, for the first time, we extend grokking to real-world factual data and address the challenge of dataset sparsity by augmenting existing knowledge graphs with carefully designed synthetic data to raise the ratio $\phi_r$ of inferred facts to atomic facts above the threshold required for grokking. Surprisingly, we find that even factually incorrect synthetic data can strengthen emergent reasoning circuits rather than degrade accuracy, as it forces the model to rely on relational structure rather than memorization. When evaluated on multi-hop reasoning benchmarks, our approach achieves up to 95--100% accuracy on 2WikiMultiHopQA -- substantially improving over strong baselines and matching or exceeding current state-of-the-art results. We further provide an in-depth analysis of how increasing $\phi_r$ drives the formation of generalizing circuits inside Transformers. Our findings suggest that grokking-based data augmentation can unlock implicit multi-hop reasoning capabilities, opening the door to more robust and interpretable factual reasoning in large-scale language models.

Paperid:2628 poster

Authors:Amit Attia, Ofir Gaash, Tomer Koren

Title: Faster Stochastic Optimization with Arbitrary Delays via Adaptive Asynchronous Mini-Batching

Abstract:
We consider the problem of asynchronous stochastic optimization, where an optimization algorithm makes updates based on stale stochastic gradients of the objective that are subject to an arbitrary (possibly adversarial) sequence of delays. We present a procedure which, for any given $q \in (0,1]$, transforms any standard stochastic first-order method to an asynchronous method with convergence guarantee depending on the $q$-quantile delay of the sequence. This approach leads to convergence rates of the form $O(\tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems, where $\tau_q$ is the $q$-quantile delay, generalizing and improving on existing results that depend on the average delay. We further show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays, achieving convergence rates of the form $O(\inf_{q} \tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\inf_{q} \tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems. Our technique is based on asynchronous mini-batching with a careful batch-size selection and filtering of stale gradients.

Paperid:2629 poster

Authors:Yuki Takezawa, Xiaowen Jiang, Anton Rodomanov, Sebastian U Stich

Title: Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization

Abstract:
Reducing communication complexity is critical for efficient decentralized optimization. The proximal decentralized optimization (PDO) framework is particularly appealing, as methods within this framework can exploit functional similarity among nodes to reduce communication rounds. Specifically, when local functions at different nodes are similar, these methods achieve faster convergence with fewer communication steps. However, existing PDO methods often require highly accurate solutions to subproblems associated with the proximal operator, resulting in significant computational overhead. In this work, we propose the Stabilized Proximal Decentralized Optimization (SPDO) method, which achieves state-of-the-art communication and computational complexities within the PDO framework. Additionally, we refine the analysis of existing PDO methods by relaxing subproblem accuracy requirements and leveraging average functional similarity. Experimental results demonstrate that SPDO significantly outperforms existing methods.

Paperid:2630 poster

Authors:Tianyi Liang, Jiangqi Liu, Yifei Huang, Shiqi Jiang, Jianshen Shi, Changbo Wang, Chenhui Li

Title: TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Abstract:
Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free approach that actively relocates objects before optimizing text regions, rather than directly reducing cross-attention which degrades image quality. Our method introduces: (1) a force-directed graph approach that detects conflicting objects and guides them relocation using cross-attention maps, and (2) a spatial attention constraint that ensures smooth background generation in text regions. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across three seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the original semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).

Paperid:2631 poster

Authors:Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai

Title: AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

Abstract:
Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduceAGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising3,382AGAVs from16VTA methods. AGAVQA-3k includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further proposeAGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA-3k, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. The dataset and code is available athttps://github.com/charlotte9524/AGAV-Rater.

Paperid:2632 poster

Authors:Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kimberly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik R Narasimhan, Ramesh Karri, Ofir Press

Title: EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Abstract:
Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We presentEnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novelInteractive Agent Toolsenable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges. Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we termsoliloquizing, where the model self-generates hallucinated observations without interacting with the environment.

Paperid:2633 poster

Authors:Lihe Li, Lei Yuan, Pengsen Liu, Tao Jiang, Yang Yu

Title: LLM-Assisted Semantically Diverse Teammate Generation for Efficient Multi-agent Coordination

Abstract:
Training with diverse teammates is the key for learning generalizable agents. Typical approaches aim to generate diverse teammates by utilizing techniques like randomization, designing regularization terms, or reducing policy compatibility, etc. However, such teammates lack semantic information, resulting in inefficient teammate generation and poor adaptability of the agents. To tackle these challenges, we propose Semantically Diverse Teammate Generation (SemDiv), a novel framework leveraging the capabilities of large language models (LLMs) to discover and learn diverse coordination behaviors at the semantic level. In each iteration, SemDiv first generates a novel coordination behavior described in natural language, then translates it into a reward function to train a teammate policy. Once the policy is verified to be meaningful, novel, and aligned with the behavior, the agents train a policy for coordination. Through this iterative process, SemDiv efficiently generates a diverse set of semantically grounded teammates, enabling agents to develop specialized policies, and select the most suitable ones through language-based reasoning to adapt to unseen teammates. Experiments show that SemDiv generates teammates covering a wide range of coordination behaviors, including those unreachable by baseline methods. Evaluation across four MARL environments, each with five unseen representative teammates, demonstrates SemDiv's superior coordination and adaptability. Our code is available athttps://github.com/lilh76/SemDiv.

Paperid:2634 poster

Authors:Xianliang Xu, Ye Li, Zhongyi Huang

Title: Refined generalization analysis of the Deep Ritz Method and Physics-Informed Neural Networks

Abstract:
In this paper, we derive refined generalization bounds for the Deep Ritz Method (DRM) and Physics-Informed Neural Networks (PINNs). For the DRM, we focus on two prototype elliptic partial differential equations (PDEs): Poisson equation and static Schrödinger equation on the $d$-dimensional unit hypercube with the Neumann boundary condition. Furthermore, sharper generalization bounds are derived based on the localization techniques under the assumptions that the exact solutions of the PDEs lie in the Barron spaces or the general Sobolev spaces. For the PINNs, we investigate the general linear second order elliptic PDEs with Dirichlet boundary condition using the local Rademacher complexity in the multi-task learning setting. Finally, we discuss the generalization error in the setting of over-parameterization when solutions of PDEs belong to Barron space.

Paperid:2635 poster

Authors:Feifei Li, Mi Zhang, Zhaoxiang Wang, Min Yang

Title: InfoCons: Identifying Interpretable Critical Concepts in Point Clouds via Information Theory

Abstract:
Interpretability of point cloud (PC) models becomes imperative given their deployment in safety-critical scenarios such as autonomous vehicles. We focus on attributing PC model outputs to interpretable critical concepts, defined as meaningful subsets of the input point cloud. To enable human-understandable diagnostics of model failures, an ideal critical subset should befaithful(preserving points that causally influence predictions) andconceptually coherent(forming semantically meaningful structures that align with human perception). We propose InfoCons, an explanation framework that applies information-theoretic principles to decompose the point cloud into 3D concepts, enabling the examination of their causal effect on model predictions with learnable priors. We evaluate InfoCons on synthetic datasets for classification, comparing it qualitatively and quantitatively with four baselines. We further demonstrate its scalability and flexibility on two real-world datasets and in two applications that utilize critical scores of PC.

Paperid:2636 poster

Authors:Ye Liu, Yuntian Chen

Title: DragSolver: A Multi-Scale Transformer for Real-World Automotive Drag Coefficient Estimation

Abstract:
Automotive drag coefficient ($C_d$) is pivotal to energy efficiency, fuel consumption, and aerodynamic performance. However, costly computational fluid dynamics (CFD) simulations and wind tunnel tests struggle to meet the rapid-iteration demands of automotive design. We present DragSolver, a Transformer-based framework for rapid and accurate $C_d$ estimation from large-scale, diverse 3D vehicle models. DragSolver tackles four key real-world challenges: (1) multi-scale feature extraction to capture both global shape and fine local geometry; (2) heterogeneous scale normalization to handle meshes with varying sizes and densities; (3) surface-guided gating to suppress internal structures irrelevant to external aerodynamics; and (4) epistemic uncertainty estimation via Monte Carlo dropout for risk-aware design. Extensive evaluations on three industrial-scale datasets (DrivaerNet, DrivaerNet++, and DrivaerML) show that DragSolver outperforms existing approaches in accuracy and generalization, achieving an average reduction of relative $L_2$ error by 58.7% across real-world datasets. Crucially, DragSolver is the first to achieve reliable, real-time $C_d$ inference on production-level automotive geometries.

Paperid:2637 poster

Authors:Hiroshi Sawada, Kazuo Aoyama, Yuya Hikima

Title: Natural Perturbations for Black-box Training of Neural Networks by Zeroth-Order Optimization

Abstract:
This paper proposes a novel concept of natural perturbations for black-box training of neural networks by zeroth-order optimization. When a neural network is implemented directly in hardware, training its parameters by backpropagation ends up with an inaccurate result due to the lack of detailed internal information. We instead employ zeroth-order optimization, where the sampling of parameter perturbations is of great importance. The sampling strategy we propose maximizes the entropy of perturbations with a regularization that the probability distribution conditioned by the neural network does not change drastically, by inheriting the concept of natural gradient. Experimental results show the superiority of our proposal on diverse datasets, tasks, and architectures.

Paperid:2638 poster

Authors:Qin Guo, Ailing Zeng, Dongxu Yue, Ceyuan Yang, Yang Cao, Hanzhong Guo, Fei Shen, Wei Liu, Xihui Liu, Dan Xu

Title: UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

Abstract:
Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.

Paperid:2639 poster

Authors:Dahun Shin, Dongyeop Lee, Jinseok Chung, Namhoon Lee

Title: Sassha: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation

Abstract:
Approximate second-order optimization methods often exhibit poorer generalization compared to first-order approaches. In this work, we look into this issue through the lens of the loss landscape and find that existing second-order methods tend to converge to sharper minima compared to SGD. In response, we propose Sassha, a novel second-order method designed to enhance generalization by explicitly reducing sharpness of the solution, while stabilizing the computation of approximate Hessians along the optimization trajectory. In fact, this sharpness minimization scheme is crafted also to accommodate lazy Hessian updates, so as to secure efficiency besides flatness. To validate its effectiveness, we conduct a wide range of standard deep learning experiments where Sassha demonstrates its outstanding generalization performance that is comparable to, and mostly better than, other methods. We provide a comprehensive set of analyses including convergence, robustness, stability, efficiency, and cost.

Paperid:2640 poster

Authors:Joonkyu Kim, Yejin Kim, Jy-yong Sohn

Title: Measuring Representational Shifts in Continual Learning: A Linear Transformation Perspective

Abstract:
In continual learning scenarios, catastrophic forgetting of previously learned tasks is a critical issue, making it essential to effectively measure such forgetting. Recently, there has been growing interest in focusing on representation forgetting, the forgetting measured at the hidden layer. In this paper, we provide the first theoretical analysis of representation forgetting and use this analysis to better understand the behavior of continual learning. First, we introduce a new metric called representation discrepancy, which measures the difference between representation spaces constructed by two snapshots of a model trained through continual learning. We demonstrate that our proposed metric serves as an effective surrogate for the representation forgetting while remaining analytically tractable. Second, through mathematical analysis of our metric, we derive several key findings about the dynamics of representation forgetting: the forgetting occurs more rapidly to a higher degree as the layer index increases, while increasing the width of the network slows down the forgetting process. Third, we support our theoretical findings through experiments on real image datasets, including Split-CIFAR100 and ImageNet1K.

Paperid:2641 poster

Authors:Yinbin Han, Meisam Razaviyayn, Renyuan Xu

Title: Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence

Abstract:
Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback–Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a {policy iteration algorithm (PI-FT)} for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm preserve the desired regularity. Finally, we extend our framework to parametric settings for efficient implementation and demonstrate the practical effectiveness of the proposed PI-FT algorithm through numerical experiments.

Paperid:2642 poster

Authors:Kang-Jun Liu, Masanori Suganuma, Takayuki Okatani

Title: Self-Supervised Learning of Intertwined Content and Positional Features for Object Detection

Abstract:
We present a novel self-supervised feature learning method using Vision Transformers (ViT) as the backbone, specifically designed for object detection and instance segmentation. Our approach addresses the challenge of extracting features that capture both class and positional information, which are crucial for these tasks. The method introduces two key components: (1) a positional encoding tied to the cropping process in contrastive learning, which utilizes a novel vector field representation for positional embeddings; and (2) masking and prediction, similar to conventional Masked Image Modeling (MIM), applied in parallel to both content and positional embeddings of image patches. These components enable the effective learning of intertwined content and positional features. We evaluate our method against state-of-the-art approaches, pre-training on ImageNet-1K and fine-tuning on downstream tasks. Our method outperforms the state-of-the-art SSL methods on the COCO object detection benchmark, achieving significant improvements with fewer pre-training epochs. These results suggest that better integration of positional information into self-supervised learning can improve performance on the dense prediction tasks.

Paperid:2643 poster

Authors:Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao

Title: AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Abstract:
Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video diffusion sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (AsymRnR),a training-free and model-agnostic method to accelerate video DiTs. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.

Paperid:2644 poster

Authors:Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

Title: Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Abstract:
Predictable behavior from scaling advanced AI systems is an extremely desirable property for engineers, companies, economists and governments alike, and while a well-established literature exists on how pretraining performance scales, predictable scaling behavior on downstream capabilities remains elusive. While many factors are certainly responsible, this paper shines a light on a significant factor that makes predicting scaling behavior on widely used multiple-choice question answering benchmarks challenging and illuminates a path towards making such downstream evaluations predictable with scale. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrades the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

Paperid:2645 poster

Authors:Awni Altabaa, John Lafferty

Title: Disentangling and Integrating Relational and Sensory Information in Transformer Architectures

Abstract:
Relational reasoning is a central component of generally intelligent systems, enabling robust and data-efficient inductive generalization. Recent empirical evidence shows that many existing neural architectures, including Transformers, struggle with tasks requiring relational reasoning. In this work, we distinguish between two types of information:sensoryinformation about the properties of individual objects, andrelationalinformation about the relationships between objects. While neural attention provides a powerful mechanism for controlling the flow of sensory information between objects, the Transformer lacks an explicit computational mechanism for routing and processing relational information. To address this limitation, we propose an architectural extension of the Transformer framework that we call theDual Attention Transformer (DAT), featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information. We empirically evaluateDATon a diverse set of tasks ranging from synthetic relational benchmarks to complex real-world tasks such as language modeling and visual processing. Our results demonstrate that integrating explicit relational computational mechanisms into the Transformer architecture leads to significant performance gains in terms of data efficiency and parameter efficiency.

Paperid:2646 poster

Authors:Xinyu Luo, Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, Brian Bullins

Title: Stacey: Promoting Stochastic Steepest Descent via Acceleratedℓp-Smooth Nonconvex Optimization

Abstract:
While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either $\ell_2$ or $\ell_\infty$ norms, there remains a critical gap in handling the non-Euclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated $\ell_p$ steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of $p$ across various models and datasets, underscoring the importance and efficiency of non-Euclidean approaches over standard Euclidean methods. Code can be found athttps://github.com/xinyuluo8561/Stacey.

Paperid:2647 poster

Authors:Simone Moretti, Paolo Pellizzoni, Francesco Silvestri

Title: Dimensionality Reduction on Complex Vector Spaces for Euclidean Distance with Dynamic Weights

Abstract:
The weighted Euclidean norm $||x||_w$ of a vector $x\in \mathbb{R}^d$ with weights $w\in \mathbb{R}^d$ is the Euclidean norm where the contribution of each dimension is scaled by a given weight. Approaches to dimensionality reduction that satisfy the Johnson–Lindenstrauss (JL) lemma can be easily adapted to the weighted Euclidean distance if weights are known and fixed: it suffices to scale each dimension of the input vectors according to the weights, and then apply any standard approach. However, this is not the case when weights are unknown during the dimensionality reduction or might dynamically change. In this paper, we address this issue by providing a linear function that maps vectors into a smaller complex vector space and allows to retrieve a JL-like estimate for the weighted Euclidean distance once weights are revealed. Our results are based on the decomposition of the complex dimensionality reduction into several Rademacher chaos random variables, which are studied using novel concentration inequalities for sums of independent Rademacher chaoses.

Paperid:2648 poster

Authors:Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen

Title: A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

Abstract:
The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

Paperid:2649 poster

Authors:Suorong Yang, Peng Ye, Furao Shen, Dongzhan Zhou

Title: When Dynamic Data Selection Meets Data Augmentation: Achieving Enhanced Training Acceleration

Abstract:
Dynamic data selection aims to accelerate training with lossless performances. However, reducing training data inherently limits data diversity, potentially hindering generalization. While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection. As a result, directly combining these techniques fails to fully exploit their synergies. To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance. Our method estimates each sample's joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data. This enables a more significant reduction in dataset size without sacrificing model generalization. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance. Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.

Paperid:2650 poster

Authors:Thiziri Nait Saada, Alireza Naderi, Jared Tanner

Title: Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers

Abstract:
Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Evenat initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapsein depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications---a common pattern across various architectures---we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapsein width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii). Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.

Paperid:2651 poster

Authors:Junlin Han, Jianyuan Wang, Andrea Vedaldi, Philip Torr, Filippos Kokkinos

Title: Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation

Abstract:
Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications. Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality. To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.

Paperid:2652 poster

Authors:Cheol Woo Kim, Jai Moondra, Shresth Verma, Madeleine Pollack, Lingkai Kong, Milind Tambe, Swati Gupta

Title: Navigating the Social Welfare Frontier: Portfolios for Multi-objective Reinforcement Learning

Abstract:
In many real-world applications of Reinforcement Learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized $p$-means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of $p$. To address this challenge, we study the concept of an $\alpha$-approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized $p$-means for all $p \in [-\infty, 1]$. We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying $p$ values, empowering decision-makers to navigate this landscape more effectively.

Paperid:2653 poster

Authors:Angxiao Yue, Zichong Wang, Hongteng Xu

Title: ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation

Abstract:
Protein backbone generation plays a central role in de novo protein design and is significant for many biological and medical applications. Although diffusion and flow-based generative models provide potential solutions to this challenging task, they often generate proteins with undesired designability and suffer computational inefficiency. In this study, we propose a novel rectified quaternion flow (ReQFlow) matching method for fast and high-quality protein backbone generation. In particular, our method generates a local translation and a 3D rotation from random noise for each residue in a protein chain, which represents each 3D rotation as a unit quaternion and constructs its flow by spherical linear interpolation (SLERP) in an exponential format. We train the model by quaternion flow (QFlow) matching with guaranteed numerical stability and rectify the QFlow model to accelerate its inference and improve the designability of generated protein backbones, leading to the proposed ReQFlow model. Experiments show that ReQFlow achieves on-par performance in protein backbone generation while requiring much fewer sampling steps and significantly less inference time (e.g., being 37×faster than RFDiffusion and 63×faster than Genie2 when generating a backbone of length 300), demonstrating its effectiveness and efficiency.

Paperid:2654 poster

Authors:Wanjin Feng, Xingyu Gao, Wenqian Du, Hailong Shi, Peilin Zhao, Pengcheng Wu, Chunyan Miao

Title: Efficient Parallel Training Methods for Spiking Neural Networks with Constant Time Complexity

Abstract:
Spiking Neural Networks (SNNs) often suffer from high time complexity $O(T)$ due to the sequential processing of $T$ spikes, making training computationally expensive. In this paper, we propose a novel Fixed-point Parallel Training (FPT) method to accelerate SNN training without modifying the network architecture or introducing additional assumptions. FPT reduces the time complexity to $O(K)$, where $K$ is a small constant (usually $K=3$), by using a fixed-point iteration form of Leaky Integrate-and-Fire (LIF) neurons for all $T$ timesteps. We provide a theoretical convergence analysis of FPT and demonstrate that existing parallel spiking neurons can be viewed as special cases of our approach. Experimental results show that FPT effectively simulates the dynamics of original LIF neurons, significantly reducing computational time without sacrificing accuracy. This makes FPT a scalable and efficient solution for real-world applications, particularly for long-duration simulations.

Paperid:2655 poster

Authors:Yang Chen, Long Yang, Yitao Liang, Zhouchen Lin

Title: Low-Dimension-to-High-Dimension Generalization and Its Implications for Length Generalization

Abstract:
Low-Dimension-to-High-Dimension (LDHD) generalization, a subset of Out-of-Distribution (OOD) generalization, involves training on a low-dimensional subspace and testing in a high-dimensional space. Assuming instances are generated from latent variables reflecting problem scale, LDHD generalization captures the inherent scaling challenge of length generalization. We theoretically show that LDHD generalization is unattainable without appropriate inductive bias. Focusing on Boolean functions, we demonstrate that different architectures trained with (S)GD converge tomin-degree interpolators w.r.t. different linearly independent sets, achieving LDHD generalization only when the target function aligns with this bias. From the perspective of LDHD generalization for length generalization, we explain the success of CoT in restructuring latent space for improved LDHD generalization. We further propose a principle for designing position embeddings to address both LDHD generalization and data format nuisances separately. Following the principle, we introduce RPE-Square, a novel embedding that enhances RPE to better handle data formats.

Paperid:2656 poster

Authors:Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin El-Nouby, Joshua M. Susskind, Vimal Thilak

Title: Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

Abstract:
Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance.These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

Paperid:2657 poster

Authors:Xinrui Wang, Shao-Yuan Li, Jiaqiang Zhang, Songcan Chen

Title: Cut out and Replay: A Simple yet Versatile Strategy for Multi-Label Online Continual Learning

Abstract:
Multi-Label Online Continual Learning (MOCL) requires models to learn continuously from endless multi-label data streams, facing complex challenges including persistent catastrophic forgetting, potential missing labels, and uncontrollable imbalanced class distributions. While existing MOCL methods attempt to address these challenges through various techniques, \textit{they all overlook label-specific region identifying and feature learning} - a fundamental solution rooted in multi-label learning but challenging to achieve in the online setting with incremental and partial supervision. To this end, we first leverage the inherent structural information of input data to evaluate and verify the innate localization capability of different pre-trained models. Then, we propose CUTER (CUT-out-and-Experience-Replay), a simple yet versatile strategy that provides fine-grained supervision signals by further identifying, strengthening and cutting out label-specific regions for efficient experience replay. It not only enables models to simultaneously address catastrophic forgetting, missing labels, and class imbalance challenges, but also serves as an orthogonal solution that seamlessly integrates with existing approaches. Extensive experiments on multiple multi-label image benchmarks demonstrate the superiority of our proposed method. The code is available at \href{https://github.com/wxr99/Cut-Replay}{https://github.com/wxr99/Cut-Replay}

Paperid:2658 poster

Authors:Shuai Yuan, Hongwei Li, Rui Zhang, Hangcheng Cao, Wenbo Jiang, Tao Ni, Wenshu Fan, Qingchuan Zhao, Guowen Xu

Title: Omni-Angle Assault: An Invisible and Powerful Physical Adversarial Attack on Face Recognition

Abstract:
Deep learning models employed in face recognition (FR) systems have been shown to be vulnerable to physical adversarial attacks through various modalities, including patches, projections, and infrared radiation. However, existing adversarial examples targeting FR systems often suffer from issues such as conspicuousness, limited effectiveness, and insufficient robustness. To address these challenges, we propose a novel approach for adversarial face generation, UVHat, which utilizes ultraviolet (UV) emitters mounted on a hat to enable invisible and potent attacks in black-box settings. Specifically, UVHat simulates UV light sources via video interpolation and models the positions of these light sources on a curved surface, specifically the human head in our study. To optimize attack performance, UVHat integrates a reinforcement learning-based optimization strategy, which explores a vast parameter search space, encompassing factors such as shooting distance, power, and wavelength. Extensive experimental evaluations validate that UVHat substantially improves the attack success rate in black-box settings, enabling adversarial attacks from multiple angles with enhanced robustness.

Paperid:2659 poster

Authors:Anle Ke, Xu Zhang, Tong Chen, Ming Lu, Chao Zhou, Jiawen Gu, Zhan Ma

Title: Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

Abstract:
Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with -80.7%, -66.3% BD-rate saving in terms of LPIPS and FID.

Paperid:2660 poster

Authors:Hao Dai, Jagmohan Chauhan

Title: Continual Generalized Category Discovery: Learning and Forgetting from a Bayesian Perspective

Abstract:
Continual Generalized Category Discovery (C-GCD) faces a critical challenge: incrementally learning new classes from unlabeled data streams while preserving knowledge of old classes. Existing methods struggle with catastrophic forgetting, especially when unlabeled data mixes known and novel categories. We address this by analyzing C-GCD’s forgetting dynamics through a Bayesian lens, revealing that covariance misalignment between old and new classes drives performance degradation. Building on this insight, we propose Variational Bayes C-GCD (VB-CGCD), a novel framework that integrates variational inference with covariance-aware nearest-class-mean classification. VB-CGCD adaptively aligns class distributions while suppressing pseudo-label noise via stochastic variational updates. Experiments show VB-CGCD surpasses prior art by +15.21% with the overall accuracy in the final session on standard benchmarks. We also introduce a new challenging benchmark with only 10% labeled data and extended online phases—VB-CGCD achieves a 67.86% final accuracy, significantly higher than state-of-the-art (38.55%), demonstrating its robust applicability across diverse scenarios. Code is available at:https://github.com/daihao42/VB-CGCD

Paperid:2661 poster

Authors:Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia wei, Haocheng Xi, Jun Zhu, Jianfei Chen

Title: SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

Abstract:
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics.

Paperid:2662 poster

Authors:Zachary Andrew Roch, George K. Atia, Yue Wang

Title: A Reduction Framework for Distributionally Robust Reinforcement Learning under Average Reward

Abstract:
Robust reinforcement learning (RL) under the average reward criterion, which seeks to optimize long-term system performance in uncertain environments, remains a largely unexplored area. To address this challenge, we propose a reduction-based framework that transforms robust average reward optimization into the more extensively studied robust discounted reward optimization by employing a specific discount factor. Our framework provides two key advantages.Data Efficiency: We design a model-based reduction algorithm that achieves near-optimal sample complexity, enabling efficient identification of optimal robust policies;Scalability: By bypassing the inherent challenges of scaling up average reward optimization, our framework facilitates the design of scalable, convergent algorithms for robust average reward optimization leveraging function approximation. Our algorithmic design, supported by theoretical and empirical analyses, provides a concrete solution to robust average reward RL with the first data efficiency and scalability guarantees, highlighting the framework’s potential to optimize long-term performance under model uncertainty in practical problems.

Paperid:2663 poster

Authors:Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, Bryan Hooi

Title: Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design

Abstract:
Handcrafting heuristics for solving complex optimization tasks (e.g., route planning and task allocation) is a common practice but requires extensive domain knowledge. Recently, Large Language Model (LLM)-based automatic heuristic design (AHD) methods have shown promise in generating high-quality heuristics without manual interventions. Existing LLM-based AHD methods employ a population to maintain a fixed number of top-performing LLM-generated heuristics and introduce evolutionary computation (EC) to iteratively enhance the population. However, these population-based procedures cannot fully develop the potential of each heuristic and are prone to converge into local optima. To more comprehensively explore the space of heuristics, this paper proposes to use Monte Carlo Tree Search (MCTS) for LLM-based heuristic evolution. The proposed MCTS-AHD method organizes all LLM-generated heuristics in a tree structure and can better develop the potential of temporarily underperforming heuristics. In experiments, MCTS-AHD delivers significantly higher-quality heuristics on various complex tasks. Our code is available.

Paperid:2664 poster

Authors:Xu Wan, Wenyue Xu, Chao Yang, Mingyang Sun

Title: Think Twice, Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making

Abstract:
Recent advancements in Large Language Models (LLMs) and Reinforcement Learning (RL) have shown significant promise in decision-making tasks. Nevertheless, for large-scale industrial decision problems, both approaches face distinct challenges: LLMs lack real-time long-sequence decision-making capabilities, while RL struggles with sample efficiency in vast action spaces. To bridge this gap, we proposeAgentsCo-Evolution (ACE), a synergistic framework between LLMs and RL agent for large-scale decision-making scenarios. ACE introduces a dual-role trajectory refinement mechanism where LLMs act as both Policy Actor and Value Critic during RL's training: the Actor refines suboptimal actions via multi-step reasoning and environment validation, while the Critic performs temporal credit assignment through trajectory-level reward shaping. Concurrently, RL agent enhance LLMs' task-specific decision-making via prioritized experience replay.Through extensive experiments across multiple power grid operation challenges with action spaces exceeding 60K discrete actions, ACE demonstrates superior performance over existing RL methods and LLM-based methods.

Paperid:2665 poster

Authors:Xinpeng Dong, Min Zhang, Didi Zhu, Ye Jun Jian, Zhang Keli, Aimin Zhou, Fei Wu, Kun Kuang

Title: ERICT: Enhancing Robustness by Identifying Concept Tokens in Zero-Shot Vision Language Models

Abstract:
Pre-trained vision-language models (VLMs) have revolutionized the field of machine learning, demonstrating exceptional performance across a wide range of tasks. However, their robustness remains vulnerable to the spurious-correlation problem. Existing works often involve fine-tuning the model with labeled data or relying on large language models (LLMs) to generate more complex prompts. Although effective to some extent, these methods introduce new challenges, including additional computational costs and dependence on the quality of prompts without fully utilizing the vision modality. To address these limitations, we propose a novel method named ERICT to Enhance model Robustness by Identifying Concept Tokens. ERICT mitigates spurious correlation directly in the inference stage and comprises two key steps: (1) Identify concept tokens capturing invariant features through auxiliary prompts to generate a token-level mask. (2) Apply the mask to the attention weights of the CLS token in the vision encoder to help the model focus on the relevant image region. Extensive experiments show that ERICT significantly improves the overall performance including that of the worst group, and achieves new state-of-the-art results.

Paperid:2666 poster

Authors:Jusheng Zhang, Zimeng Huang, Yijia Fan, Ningyuan Liu, Mingyan Li, Zhuojie Yang, Jiawei Yao, Jian Wang, Keze Wang

Title: KABB: Knowledge-Aware Bayesian Bandits for Dynamic Expert Coordination in Multi-Agent Systems

Abstract:
As scaling large language models faces prohibitive costs, multi-agent systems emerge as a promising alternative, though challenged by static knowledge assumptions and coordination inefficiencies. We introduce Knowledge-Aware Bayesian Bandits (KABB), a novel framework that enhances multi-agent system coordination through semantic understanding and dynamic adaptation. The framework features three key innovations: a customized knowledge distance model for deep semantic understanding, a dual-adaptation mechanism for continuous expert optimization, and a knowledge-aware Thompson Sampling strategy for efficient expert selection. Extensive evaluation demonstrates KABB achieves an optimal cost-performance balance, maintaining high performance while keeping computational demands relatively low in multi-agent coordination.

Paperid:2667 poster

Authors:Junyu Luo, Yuhao Tang, yiwei Fu, Xiao Luo, Zhizhuo KOU, Zhiping Xiao, Wei Ju, Wentao Zhang, Ming Zhang

Title: Sparse Causal Discovery with Generative Intervention for Unsupervised Graph Domain Adaptation

Abstract:
Unsupervised Graph Domain Adaptation (UGDA) leverages labeled source domain graphs to achieve effective performance in unlabeled target domains despite distribution shifts. However, existing methods often yield suboptimal results due to the entanglement of causal-spurious features and the failure of global alignment strategies. We propose SLOGAN (Sparse Causal Discovery with Generative Intervention), a novel approach that achieves stable graph representation transfer through sparse causal modeling and dynamic intervention mechanisms. Specifically, SLOGAN first constructs a sparse causal graph structure, leveraging mutual information bottleneck constraints to disentangle sparse, stable causal features while compressing domain-dependent spurious correlations through variational inference. To address residual spurious correlations, we innovatively design a generative intervention mechanism that breaks local spurious couplings through cross-domain feature recombination while maintaining causal feature semantic consistency via covariance constraints. Furthermore, to mitigate error accumulation in target domain pseudo-labels, we introduce a category-adaptive dynamic calibration strategy, ensuring stable discriminative learning. Extensive experiments on multiple real-world datasets demonstrate that SLOGAN significantly outperforms existing baselines.

Paperid:2668 poster

Authors:Renhao Lu

Title: Complex Wavelet Mutual Information Loss: A Multi-Scale Loss Function for Semantic Segmentation

Abstract:
Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel-wise, while regional and boundary-focused loss functions often incur high computational costs or are restricted to small-scale regions. To address this limitation, we propose the complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual information from subband images decomposed by a complex steerable pyramid. The complex steerable pyramid captures features across multiple orientations and preserves structural similarity across scales. Meanwhile, mutual information is well-suited to capturing high-dimensional directional features and offers greater noise robustness. Extensive experiments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel-wise accuracy and topological metrics compared to state-of-the-art methods, while introducing minimal computational overhead. Our code is available athttps://github.com/lurenhaothu/CWMI

Paperid:2669 poster

Authors:Donghwa Kim, Jaewook Lee, Chulhee Yun

Title: Provable Benefit of Random Permutations over Uniform Sampling in Stochastic Coordinate Descent

Abstract:
We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms�?performance has remained elusive. Even for the benign case of positive-definite quadratic functions with permutation-invariant Hessians, previous efforts have failed to demonstrate a provable performance gap between RCD and RPCD. To this end, we present novel results showing that, for a class of quadratics with permutation-invariant structures, the contraction rate upper bound for RPCD is always strictly smaller than the contraction rate lower bound for RCD for every individual problem instance. Furthermore, we conjecture that this function class contains the worst-case examples of RPCD among all positive-definite quadratics. Combined with our RCD lower bound, this conjecture extends our results to the general class of positive-definite quadratic functions.

Paperid:2670 poster

Authors:Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin

Title: LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Abstract:
Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck. In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets. Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache's effectiveness in enhancing LLMs' long-range capabilities. Our code is available athttps://github.com/GATECH-EIC/LaCache.

Paperid:2671 poster

Authors:Jason Chan, Robert J. Gaizauskas, Zhixue Zhao

Title: RULEBREAKERS: Challenging LLMs at the Crossroads between Formal Logic and Human-like Reasoning

Abstract:
Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterizes as "rulebreaker" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognize and respond to rulebreakers (versus non-rulebreakers) in a knowledge-informed and human-like manner. Evaluating seven LLMs, we find that most models achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules, unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilization of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.

Paperid:2672 poster

Authors:HamidReza Imani, Jiaxin Peng, Peiman Mohseni, Abdolah Amirany, Tarek El-Ghazawi

Title: QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

Abstract:
The deployment of mixture-of-experts (MoE) large language models (LLMs) presents significant challenges due to their high memory demands. These challenges become even more pronounced in multi-tenant environments, where shared resources must accommodate multiple models, limiting the effectiveness of conventional virtualization techniques. This paper addresses the problem of efficiently serving multiple fine-tuned MoE-LLMs on a single GPU. We propose a serving system that employs \textit{similarity-based expert consolidation} to reduce the overall memory footprint by sharing similar experts across models. To ensure output quality, we introduce \textit{runtime partial reconfiguration}, dynamically replacing non-expert layers when processing requests from different models. As a result, our approach achieves competitive output quality while maintaining throughput comparable to serving a single model, and incurs only a negligible increase in time-to-first-token (TTFT). Experiments on a server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG). Furthermore, experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.

Paperid:2673 poster

Authors:Jayadev Naram, Fredrik Hellström, Ziming Wang, Rebecka Jörnsten, Giuseppe Durisi

Title: Theoretical Performance Guarantees for Partial Domain Adaptation via Partial Optimal Transport

Abstract:
In many scenarios of practical interest, labeled data from a target distribution are scarce while labeled data from a related source distribution are abundant. One particular setting of interest arises when the target label space is a subset of the source label space, leading to the framework of partial domain adaptation (PDA). Typical approaches to PDA involve minimizing a domain alignment term and a weighted empirical loss on the source data, with the aim of transferring knowledge between domains. However, a theoretical basis for this procedure is lacking, and in particular, most existing weighting schemes are heuristic. In this work, we derive generalization bounds for the PDA problem based on partial optimal transport. These bounds corroborate the use of the partial Wasserstein distance as a domain alignment term, and lead to theoretically motivated explicit expressions for the empirical source loss weights. Inspired by these bounds, we devise a practical algorithm for PDA, termed WARMPOT. Through extensive numerical experiments, we show that WARMPOT is competitive with recent approaches, and that our proposed weights improve on existing schemes.

Paperid:2674 poster

Authors:Yilun Kong, Guozheng Ma, Qi Zhao, Haoyu Wang, Li Shen, Xueqian Wang, Dacheng Tao

Title: Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer

Abstract:
Despite recent advancements in offline multi-task reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively expanding the parameters proves insufficient to counteract the performance degradation as the number of tasks escalates. Building upon these insights, we propose M3DT, a novel mixture-of-experts (MoE) framework that tackles task scalability by further unlocking the model’s parameter scalability. Specifically, we enhance both the architecture and the optimization of the agent, where we strengthen the Decision Transformer (DT) backbone with MoE to reduce task load on parameter subsets, and introduce a three-stage training mechanism to facilitate efficient training with optimal performance. Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance.

Paperid:2675 poster

Authors:Nay Myat Min, Long H. Pham, Yige Li, Jun Sun

Title: CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Abstract:
Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers. Existing defense methods—designed for vision/text classification tasks—fail for text generation. We proposeInternal Consistency Regularization (CROW), a defense leveraging the observation that backdoored models exhibit unstable layer-wise hidden representations when triggered, while clean models show smooth transitions. CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge—only a small clean dataset. Experiments across Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW’s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance. CROW’s architecture-agnostic design enables practical deployment.

Paperid:2676 poster

Authors:Andrey Kofnov, Daniel Kapla, Ezio Bartocci, Efstathia Bura

Title: Exact Upper and Lower Bounds for the Output Distribution of Neural Networks with Random Inputs

Abstract:
We derive exact upper and lower bounds for the cumulative distribution function (cdf) of the output of a neural network (NN) over its entire support subject to noisy (stochastic) inputs. The upper and lower bounds converge to the true cdf over its domain as the resolution increases. Our method applies to any feedforward NN using continuous monotonic piecewise twice continuously differentiable activation functions (e.g., ReLU, tanh and softmax) and convolutional NNs, which were beyond the scope of competing approaches. The novelty and instrumental tool of our approach is to bound general NNs with ReLU NNs. The ReLU NN-based bounds are then used to derive the upper and lower bounds of the cdf of the NN output. Experiments demonstrate that our method delivers guaranteed bounds of the predictive output distribution over its support, thus providing exact error guarantees, in contrast to competing approaches.

Paperid:2677 poster

Authors:Supratim Shit, Gurmehak kaur chadha, Surendra kumar, Bapi Chatterjee

Title: Improved Coresets for Vertical Federated Learning: Regularized Linear and Logistic Regressions

Abstract:
Coreset, as a summary of training data, offers an efficient approach for reducing data processing and storage complexity during training. In the emerging vertical federated learning (VFL) setting, where scattered clients store different data features, it directly reduces communication complexity. In this work, we introduce coresets construction for regularized logistic regression both in centralized and VFL settings. Additionally, we improve the coreset size for regularized linear regression in the VFL setting. We also eliminate the dependency of the coreset size on a property of the data due to the VFL setting. The improvement in the coreset sizes is due to our novel coreset construction algorithms that capture the reduced model complexity due to the added regularization and its subsequent analysis. In experiments, we provide extensive empirical evaluation that backs our theoretical claims. We also report the performance of our coresets by comparing the models trained on the complete data and on the coreset.

Paperid:2678 poster

Authors:Mikel Malagón, Josu Ceberio, Jose A. Lozano

Title: Craftium: Bridging Flexibility and Efficiency for Rich 3D Single- and Multi-Agent Environments

Abstract:
Advances in large models, reinforcement learning, and open-endedness have accelerated progress toward autonomous agents that can learn and interact in the real world. To achieve this, flexible tools are needed to create rich, yet computationally efficient, environments. While scalable 2D environments fail to address key real-world challenges like 3D navigation and spatial reasoning, more complex 3D environments are computationally expensive and lack features like customizability and multi-agent support. This paper introduces Craftium, a highly customizable and easy-to-use platform for building rich 3D single- and multi-agent environments. We showcase environments of different complexity and nature: from single- and multi-agent tasks to vast worlds with many creatures and biomes, and customizable procedural task generators. Benchmarking shows that Craftium significantly reduces the computational cost of alternatives of similar richness, achieving +2K steps per second more than Minecraft-based frameworks.

Paperid:2679 poster

Authors:Yoon Hyeok Lee, Jaemin Park, Taejin Paik, Doyun Kim, Bosun Hwang

Title: From Theory to Practice: Rethinking Green and Martin Kernels for Unleashing Graph Transformers

Abstract:
Graph Transformers (GTs) have emerged as a powerful alternative to message-passing neural networks, yet their performance heavily depends on effectively embedding structural inductive biases. In this work, we introduce novel structural encodings (SEs) grounded in a rigorous analysis of random walks (RWs), leveraging Green and Martin kernels that we have carefully redefined for AI applications while preserving their mathematical essence.These kernels capture the long-term behavior of RWs on graphs and allow for enhanced representation of complex topologies, including non-aperiodic and directed acyclic substructures.Empirical evaluations across eight benchmark datasets demonstrate strong performance across diverse tasks, notably in molecular and circuit domains.We attribute this performance boost to the improved ability of our kernel-based SEs to encode intricate structural information, thereby strengthening the global attention and inductive bias within GTs.This work highlights the effectiveness of theoretically grounded kernel methods in advancing Transformer-based models for graph learning.

Paperid:2680 poster

Authors:Robbert Reijnen, Yaoxin Wu, Zaharah Bukhsh, Yingqian Zhang

Title: Graph-Supported Dynamic Algorithm Configuration for Multi-Objective Combinatorial Optimization

Abstract:
Deep reinforcement learning (DRL) has been widely used for dynamic algorithm configuration, particularly in evolutionary computation, which benefits from the adaptive update of parameters during the algorithmic execution. However, applying DRL to algorithm configuration for multi-objective combinatorial optimization (MOCO) problems remains relatively unexplored. This paper presents a novel graph neural network (GNN) based DRL to configure multi-objective evolutionary algorithms. We model the dynamic algorithm configuration as a Markov decision process, representing the convergence of solutions in the objective space by a graph, with their embeddings learned by a GNN to enhance the state representation. Experiments on diverse MOCO challenges indicate that our method outperforms traditional and DRL-based algorithm configuration methods in terms of efficacy and adaptability. It also exhibits advantageous generalizability across objective types and problem sizes, and applicability to different evolutionary computation methods.

Paperid:2681 poster

Authors:Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, Bowen Zhou

Title: Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization

Abstract:
Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While prior works mainly address RoPE's limitations within attention, this paper uncovers the adverse effects on length generalization from nearly all parts of LMs. UsingDiscrete Signal Processingtheory, we show that RoPE enables periodic attention by implicitly achievingNon-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectrum damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we proposeFourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructsFourier Seriesand zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales and benchmarks show that, within varying context windows, FoPE maintains a more stable performance compared to other baselines. Several analyses and ablations bring further support to our method and theoretical modeling.

Paperid:2682 poster

Authors:Hongyao Chen, Tianyang Xu, Xiaojun Wu, Josef Kittler

Title: Hybrid Batch Normalisation: Resolving the Dilemma of Batch Normalisation in Federated Learning

Abstract:
Batch Normalisation (BN) is widely used in conventional deep neural network training to harmonise the input-output distributions for each batch of data. However, federated learning, a distributed learning paradigm, faces the challenge of dealing with non-independent and identically distributed data among the client nodes. Due to the lack of a coherent methodology for updating BN statistical parameters, standard BN degrades the federated learning performance. To this end, it is urgent to explore an alternative normalisation solution for federated learning. In this work, we resolve the dilemma of the BN layer in federated learning by developing a customised normalisation approach, Hybrid Batch Normalisation (HBN). HBN separates the update of statistical parameters (i.e., means and variances used for evaluation) from that of learnable parameters (i.e., parameters that require gradient updates), obtaining unbiased estimates of global statistical parameters in distributed scenarios. In contrast with the existing solutions, we emphasise the supportive power of global statistics for federated learning. The HBN layer introduces a learnable hybrid distribution factor, allowing each computing node to adaptively mix the statistical parameters of the current batch with the global statistics. Our HBN can serve as a powerful plugin to advance federated learning performance. It reflects promising merits across a wide range of federated learning settings, especially for small batch sizes and heterogeneous data. Code is available athttps://github.com/Hongyao-Chen/HybridBN.

Paperid:2683 poster

Authors:Minghao Wu, Thuy-Trang Vu, Lizhen Qu, Gholamreza Haffari

Title: The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Abstract:
The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised fine-tuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six widely-used benchmarks, demonstrating that it outperforms nine existing baselines in both model performance and computational efficiency. Further analysis shows that our design choices lead to more effective subset selection, underscores the value of instruction diversity, and provides insights into how quality and diversity interact with different subset sizes.

Paperid:2684 poster

Authors:Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, Neel Nanda

Title: SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Abstract:
Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across seven recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs atwww.neuronpedia.org/sae-bench

Paperid:2685 poster

Authors:Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, Lei Wu

Title: The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Abstract:
Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important. In this paper, we uncover a clearsharpness disparityacross these blocks, which intriguingly emerges early in training and persists throughout the training process. Building on this insight, we propose a novelBlockwise Learning Rate (LR)strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.

Paperid:2686 poster

Authors:Wei Shen, Ruida Zhou, Jing Yang, Cong Shen

Title: On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures

Abstract:
Although transformers have demonstrated impressive capabilities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism that allows transformers to perform ICL is still in its infancy. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels. Experimental results corroborate the theoretical findings.

Paperid:2687 poster

Authors:Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Yuan-Fang Li, Chen Gong, Shirui Pan

Title: Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

Abstract:
Large language models (LLMs) have demonstrated impressive reasoning abilities, but they still struggle with faithful reasoning due to knowledge gaps and hallucinations. To address these issues, knowledge graphs (KGs) have been utilized to enhance LLM reasoning through their structured knowledge. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this work, we introduce graph-constrained reasoning (GCR), a novel framework that bridges structured knowledge in KGs with unstructured reasoning in LLMs. To eliminate hallucinations, GCR ensures faithful KG-grounded reasoning by integrating KG structure into the LLM decoding process through KG-Trie, a trie-based index that encodes KG reasoning paths. KG-Trie constrains the decoding process, allowing LLMs to directly reason on graphs and generate faithful reasoning paths grounded in KGs. Additionally, GCR leverages a lightweight KG-specialized LLM for graph-constrained reasoning alongside a powerful general LLM for inductive reasoning over multiple reasoning paths, resulting in accurate reasoning with zero reasoning hallucination. Extensive experiments on several KGQA benchmarks demonstrate that GCR achieves state-of-the-art performance and exhibits strong zero-shot generalizability to unseen KGs without additional training.

Paperid:2688 poster

Authors:Junyi Chai, Taeuk Jang, Jing Gao, Xiaoqian Wang

Title: On the Alignment between Fairness and Accuracy: from the Perspective of Adversarial Robustness

Abstract:
While numerous work has been proposed to address fairness in machine learning, existing methods do not guarantee fair predictions under imperceptible feature perturbation, and a seemingly fair model can suffer from large group-wise disparities under such perturbation. Moreover, while adversarial training has been shown to be reliable in improving a model's robustness to defend against adversarial feature perturbation that deteriorates accuracy, it has not been properly studied in the context of adversarial perturbation against fairness. To tackle these challenges, in this paper, we study the problem of adversarial attack and adversarial robustness w.r.t. two terms: fairness and accuracy. From the adversarial attack perspective, we propose a unified structure for adversarial attacks against fairness which brings together common notions in group fairness, and we theoretically prove the equivalence of adversarial attacks against different fairness notions. Further, we derive the connections between adversarial attacks against fairness and those against accuracy. From the adversarial robustness perspective, we theoretically align robustness to adversarial attacks against fairness and accuracy, where robustness w.r.t. one term enhances robustness w.r.t. the other term. Our study suggests a novel way to unify adversarial training w.r.t. fairness and accuracy, and experiments show our proposed method achieves better robustness w.r.t. both terms.

Paperid:2689 poster

Authors:Tianyu Cui, Song-Jun Xu, Artem Moskalev, Shuwei Li, Tommaso Mansi, Mangal Prakash, Rui Liao

Title: InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference

Abstract:
Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases—such as class imbalances of GT interactions—rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors.

Paperid:2690 poster

Authors:Siwei Xia, Li Sun, Tiantian Sun, Qingli Li

Title: DragLoRA: Online Optimization of LoRA Adapters for Drag-based Image Editing in Diffusion Model

Abstract:
Drag-based editing within pretrained diffusion model provides a precise and flexible way to manipulate foreground objects. Traditional methods optimize the input feature obtained from DDIM inversion directly, adjusting them iteratively to guide handle points towards target locations. However, these approaches often suffer from limited accuracy due to the low representation ability of the feature in motion supervision, as well as inefficiencies caused by the large search space required for point tracking. To address these limitations, we present DragLoRA, a novel framework that integrates LoRA (Low-Rank Adaptation) adapters into the drag-based editing pipeline. To enhance the training of LoRA adapters, we introduce an additional denoising score distillation loss which regularizes the online model by aligning its output with that of the original model. Additionally, we improve the consistency of motion supervision by adapting the input features using the updated LoRA, giving a more stable and accurate input feature for subsequent operations. Building on this, we design an adaptive optimization scheme that dynamically toggles between two modes, prioritizing efficiency without compromising precision. Extensive experiments demonstrate that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing. The Codes of DragLoRA are available at:https://github.com/Sylvie-X/DragLoRA.

Paperid:2691 poster

Authors:Shixi Qin, Zhiyong Yang, Shilong Bao, Shi Wang, Qianqian Xu, Qingming Huang

Title: MixBridge: Heterogeneous Image-to-Image Backdoor Attack through Mixture of Schrödinger Bridges

Abstract:
This paper focuses on implanting multiple heterogeneous backdoor triggers in bridge-based diffusion models designed for complex and arbitrary input distributions. Existing backdoor formulations mainly address single-attack scenarios and are limited to Gaussian noise input models. To fill this gap, we propose MixBridge, a novel diffusion Schrödinger bridge (DSB) framework to cater to arbitrary input distributions (taking I2I tasks as special cases). Beyond this trait, we demonstrate that backdoor triggers can be injected into MixBridge by directly training with poisoned image pairs. This eliminates the need for the cumbersome modifications to stochastic differential equations required in previous studies, providing a flexible tool to study backdoor behavior for bridge models. However, a key question arises: can a single DSB model train multiple backdoor triggers? Unfortunately, our theory shows that when attempting this, the model ends up following the geometric mean of benign and backdoored distributions, leading to performance conflict across backdoor tasks. To overcome this, we propose a Divide-and-Merge strategy to mix different bridges, where models are independently pre-trained for each specific objective (Divide) and then integrated into a unified model (Merge). In addition, a Weight Reallocation Scheme (WRS) is also designed to enhance the stealthiness of MixBridge. Empirical studies across diverse generation tasks speak to the efficacy of MixBridge. The code is available at:https://github.com/qsx830/MixBridge.

Paperid:2692 poster

Authors:Zhenyu Sun, Ermin Wei

Title: Improved Lower Bounds for First-order Stochastic Non-convex Optimization under Markov Sampling

Abstract:
Unlike its vanilla counterpart with i.i.d. samples, stochastic optimization with Markovian sampling allows the sampling scheme following a Markov chain. This problem encompasses various applications that range from asynchronous distributed optimization to reinforcement learning. In this work, we lower bound the sample complexity of findingϵ-approximate critical solutions for any first-order methods when sampling is Markovian. We show that for samples drawn from stationary Markov processes with countable state space, any algorithm that accesses smooth, non-convex functions through queries to a stochastic gradient oracle, requires at leastΩ(ϵ�?)samples. Moreover, for finite Markov chains, we show aΩ(ϵ�?)lower bound and propose a new algorithm, called MaC-SAGE, that is proven to (nearly) match our lower bound.

Paperid:2693 poster

Authors:Mohammadreza Daneshvaramoli, Helia Karisani, Adam Lechowicz, Bo Sun, Cameron N Musco, Mohammad Hajiesmaili

Title: Near-Optimal Consistency-Robustness Trade-Offs for Learning-Augmented Online Knapsack Problems

Abstract:
This paper introduces a family of learning-augmented algorithms for online knapsack problems that achieve near Pareto-optimal consistency-robustness trade-offs through a simple combination of trusted learning-augmented and worst-case algorithms. Our approach relies on succinct, practical predictions—single values or intervals estimating the minimum value of any item in an offline solution. Additionally, we propose a novel fractional-to-integral conversion procedure, offering new insights for online algorithm design.

Paperid:2694 poster

Authors:Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn

Title: Efficient Time Series Processing for Transformers and State-Space Models through Token Merging

Abstract:
Despite recent advances in subquadratic attention mechanisms or state-space models, processing long token sequences still imposes significant computational requirements. Token merging has emerged as a solution to increase computational efficiency in computer vision architectures. In this work, we perform the first investigations of token merging in time series analysis on both transformers and state-space models. We further introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood, achieving two major benefits: a) Local merging can adjust its computational complexity from quadratic to linear based on the neighborhood size to effectively scale to long sequences; b) Local merging is the first causal merging scheme enabling token merging in transformer decoders. Further, we identify spectral properties of the input data that reliably predict the potential benefits of local merging without requiring evaluation on downstream tasks. Our comprehensive empirical evaluation demonstrates that local merging offers substantial efficiency gains with minimal impact on accuracy, achieving up to 5400% acceleration on the recently proposed Chronos foundation model.

Paperid:2695 poster

Authors:Subhodh Kotekal

Title: Sharp Optimality of Simple, Plug-in Estimation of the Fisher Information of a Smoothed Density

Abstract:
Given independent and identically distributed data from a compactly supported,α-Hölder densityf, we study estimation of the Fisher information of the Gaussian-smoothed densityf∗φt, whereφtis the density ofN(0,t). We derive the minimax rate including the sharp dependence ontand show some simple, plug-in type estimators are optimal fort>0, even though extra debiasing steps are widely employed in the literature to achieve the sharp rate in the unsmoothed (t=0) case. Due to our result's sharp characterization of the scaling int, plug-in estimators of the mutual information and entropy are shown to achieve the parametric rate by way of the I-MMSE and de Bruijn's identities.

Paperid:2696 poster

Authors:Martino Bernasconi, Matteo Castiglioni, Andrea Celli

Title: No-Regret is not enough! Bandits with General Constraints through Adaptive Regret Minimization

Abstract:
In the bandits with knapsacks framework (BwK) the learner hasmresource-consumption (i.e., packing) constraints. We focus on the generalization of BwK in which the learner has a set of general long-term constraints. The goal of the learner is to maximize their cumulative reward, while at the same time achieving small cumulative constraints violations. In this scenario, there exist simple instances where conventional methods for BwK fail to yield sublinear violations of constraints. We show that it is possible to circumvent this issue by requiring the primal and dual algorithm to be weakly adaptive. Indeed, even without any information on the Slater's parameterρcharacterizing the problem, the interaction between weakly adaptive primal and dual regret minimizers leads to a ``self-bounding'' behavior of dual variables. In particular, their norm remains suitably upper bounded across the entire time horizon even without explicit projection steps. By exploiting this property, we provide best-of-both-worlds guarantees for stochastic and adversarial inputs. In the first case, we show that the algorithm guarantees sublinear regret. In the latter case, we establish a tight competitive ratio ofρ/(1+ρ). In both settings, constraints violations are guaranteed to be sublinear in time. Finally, this results allow us to obtain new result for the problem of contextual bandits with linear constraints, providing the first no-α-regret guarantees for adversarial contexts.

Paperid:2697 poster

Authors:Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Nanqing Dong, ZhiQiang Gao, Siqi Sun

Title: Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing

Abstract:
Peptide sequencing—the process of identifying amino acid sequences from mass spectrometry data—is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through unmasked self-attention. However, existing NAT approaches often rely on Connectionist Temporal Classification (CTC) loss, which presents significant optimization challenges due to CTC's complexity and increases the risk of training failures. To address these issues, we propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy. This approach adjusts protein's learning difficulty based on the model’s estimated protein generational capabilities through a sampling process, progressively learning peptide generation from simple to complex sequences. Additionally, we introduce a self-refining inference-time module that iteratively enhances predictions using learned NAT token embeddings, improving sequence accuracy at a fine-grained level. Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distributions. Evaluations on nine benchmark species demonstrate that our approach outperforms all previous methods across multiple metrics and species. Model and source code are available athttps://github.com/BEAM-Labs/denovo.

Paperid:2698 poster

Authors:Zicheng Lin, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, Zhaopeng Tu

Title: Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM’s Reasoning Capability

Abstract:
Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept of critical tokens -- elements within reasoning trajectories that significantly influence incorrect outcomes. We present a novel framework for identifying these tokens through rollout sampling and demonstrate their substantial divergence from traditional error tokens. Through extensive experiments on datasets such as GSM8K and MATH500, we show that identifying and replacing critical tokens significantly improves model accuracy. We propose an efficient methodology for pinpointing these tokens in large-scale datasets using contrastive estimation and extend this framework to enhance model training processes with direct preference optimization (DPO). Experimental results on GSM8K and MATH500 benchmarks with the widely used models Llama-3 (8B and 70B) and Deepseek-math (7B) demonstrate the effectiveness of the proposed approach, cDPO. Our results underscore the potential of leveraging critical tokens to reduce errors in reasoning tasks, advancing the development of AI systems capable of robust logical deduction.

Paperid:2699 poster

Authors:Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora

Title: Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Abstract:
Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning---even compared to LLMs on the same tasks presented in text form---giving rise to perceptions ofmodality imbalanceorbrittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of the modality imbalance and how it is impacted by training strategy. We show that 1) explicit image-to-text conversion is important in promoting S2H generalization on images, by transferring reasoning from text; 2) conversion can be internalized at test time. We also report results of mechanistic study of this phenomenon. We identify measures of gradient alignment that can identify training strategies that promote better S2H generalization. Ablations highlight the importance of chain-of-thought.

Paperid:2700 poster

Authors:Dongsu Lee, Minhae Kwon

Title: Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning

Abstract:
The goal of offline reinforcement learning (RL) is to extract the best possible policy from the previously collected dataset considering theout-of-distribution(OOD) sample issue. Offline model-based RL (MBRL) is a captivating solution capable of alleviating such issues through a \textit{state-action transition augmentation} with a learned dynamic model. Unfortunately, offline MBRL methods have been observed to fail in sparse rewarded and long-horizon environments for a long time. In this work, we propose a novel MBRL method, dubbed Temporal Distance-Aware Transition Augmentation (TempDATA), that generates additional transitions in a geometrically structured representation space, instead of state space. For comprehending long-horizon behaviors efficiently, our main idea is to learn state abstraction, which captures atemporal distancefrom bothtrajectory and transition levelsof state space. Our experiments empirically confirm that TempDATA outperforms previous offline MBRL methods and achieves matching or surpassing the performance of diffusion-based trajectory augmentation and goal-conditioned RL on the D4RL AntMaze, FrankaKitchen, CALVIN, and pixel-based FrankaKitchen.

Paperid:2701 poster

Authors:Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su

Title: Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Abstract:
Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable sub-goals. In this work, we propose DEMO³, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult tasks compared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

Paperid:2702 poster

Authors:Haoyang Li, Xin Wang, Xueling Zhu, Weigao Wen, Wenwu Zhu

Title: Disentangling Invariant Subgraph via Variance Contrastive Estimation under Distribution Shifts

Abstract:
Graph neural networks (GNNs) have achieved remarkable success, yet most are developed under the in-distribution assumption and fail to generalize to out-of-distribution (OOD) environments. To tackle this problem, some graph invariant learning methods aim to learn invariant subgraph against distribution shifts, which heavily rely on predefined or automatically generated environment labels. However, directly annotating or estimating such environment labels from biased graph data is typically impractical or inaccurate for real-world graphs. Consequently, GNNs may become biased toward variant patterns, resulting in poor OOD generalization. In this paper, we propose to learn disentangled invariant subgraph via self-supervised contrastive variant subgraph estimation for achieving satisfactory OOD generalization. Specifically, we first propose a GNN-based invariant subgraph generator to disentangle the invariant and variant subgraphs. Then, we estimate the degree of the spurious correlations by conducting self-supervised contrastive learning on variant subgraphs. Thanks to the accurate identification and estimation of the variant subgraphs, we can capture invariant subgraphs effectively and further eliminate spurious correlations by inverse propensity score reweighting. We provide theoretical analyses to show that our model can disentangle the ground-truth invariant and variant subgraphs for OOD generalization. Extensive experiments demonstrate the superiority of our model over state-of-the-art baselines.

Paperid:2703 poster

Authors:Mingjia Yin, Junwei Pan, Hao Wang, Ximei Wang, Shangyu Zhang, Jie Jiang, Defu Lian, Enhong Chen

Title: From Feature Interaction to Feature Generation: A Generative Paradigm of CTR Prediction Models

Abstract:
Click-Through Rate (CTR) prediction models estimate the probability of users clicking on items based on feature interactions, inherently following a discriminative paradigm. However, this paradigm is prone to embedding dimensional collapse and information redundancy due to limitations of vanilla feature embeddings. This motivates us to reformulate it into a generative paradigm to generate new feature embeddings. Unlike sequential recommendation, which naturally fits a generative "next-item prediction" paradigm, it's hard to formulate CTR models into this paradigm without explicit feature order. Therefore, we propose a novel Supervised Feature Generation framework for CTR models, shifting from the discriminative "feature interaction" paradigm to the generative "feature generation" paradigm. Specifically, we predict each feature embedding based on the concatenation of all feature embeddings. Besides, this paradigm naturally accommodates a supervised binary cross-entropy loss to indicate whether the sample is positive or negative. The framework can reformulate nearly every existing CTR model and bring significant performance lifts. Moreover, it produces less-collapsed and redundancy-reduced feature embeddings, thereby mitigating the inherent limitations of the discriminative paradigm. The code can be found athttps://github.com/USTC-StarTeam/GE4Rec.

Paperid:2704 poster

Authors:Zhaowei Chen, Borui Zhao, Yuchen Ge, Yuhao Chen, Renjie Song, Jiajun Liang

Title: Asymmetric Decision-Making in Online Knowledge Distillation: Unifying Consensus and Divergence

Abstract:
Online Knowledge Distillation (OKD) methods represent a streamlined, one-stage distillation training process that obviates the necessity of transferring knowledge from a pretrained teacher network to a more compact student network. In contrast to existing logits-based OKD methods, this paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on the foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when transferred to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

Paperid:2705 poster

Authors:Yuxin Dong, Haoran Guo, Tieliang Gong, Wen Wen, Chen Li

Title: Exactly Tight Information-theoretic Generalization Bounds via Binary Jensen-Shannon Divergence

Abstract:
Information-theoretic bounds, while achieving significant success in analyzing the generalization of randomized learning algorithms, have been criticized for their slow convergence rates and overestimation. This paper presents novel bounds that bridge the expected empirical and population risks through a binarized variant of the Jensen-Shannon divergence. Leveraging our foundational lemma that characterizes the interaction between an arbitrary and a binary variable, we derive hypothesis-based bounds that enhance existing conditional mutual information bounds by reducing the number of conditioned samples from $2$ to $1$. We additionally establish prediction-based bounds that surpass prior bounds based on evaluated loss mutual information measures. Thereafter, through a new binarization technique for the evaluated loss variables, we obtain exactly tight generalization bounds broadly applicable to general randomized learning algorithms for any bounded loss functions. Our results effectively address key limitations of previous results in analyzing certain stochastic convex optimization problems, without requiring additional stability or compressibility assumptions about the learning algorithm.

Paperid:2706 poster

Authors:Yoann Boget

Title: Simple and Critical Iterative Denoising: A Recasting of Discrete Diffusion in Graph Generation

Abstract:
Discrete Diffusion and Flow Matching models have significantly advanced generative modeling for discrete structures, including graphs. However, the dependencies of the noisy distributions across time of these models lead to error accumulation and propagation during the reverse denoising process—a phenomenon known as \emph{compounding denoising errors}. To address this problem, we propose a novel framework called \emph{Simple Iterative Denoising}, which simplifies discrete diffusion and circumvents the issue by removing dependencies on previous intermediate states in the noising process. Additionally, we enhance our model by incorporating a \emph{Critic}, which during generation selectively retains or corrupts elements in an instance based on their likelihood under the data distribution. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks.

Paperid:2707 poster

Authors:Grigory Bartosh, Dmitry Vetrov, Christian A. Naesseth

Title: SDE Matching: Scalable and Simulation-Free Training of Latent Stochastic Differential Equations

Abstract:
The Latent Stochastic Differential Equation (SDE) is a powerful tool for time series and sequence modeling. However, training Latent SDEs typically relies on adjoint sensitivity methods, which depend on simulation and backpropagation through approximate SDE solutions, which limit scalability. In this work, we propose SDE Matching, a new simulation-free method for training Latent SDEs. Inspired by modern Score- and Flow Matching algorithms for learning generative dynamics, we extend these ideas to the domain of stochastic dynamics for time series modeling, eliminating the need for costly numerical simulations. Our results demonstrate that SDE Matching achieves performance comparable to adjoint sensitivity methods while drastically reducing computational complexity.

Paperid:2708 poster

Authors:Ruiyuan Huang, Zengfeng Huang

Title: High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

Abstract:
Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round’s context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider & Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability. There are steps in the original analysis by Schneider & Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

Paperid:2709 poster

Authors:Haoyu Peter Wang, Shikun Liu, Rongzhe Wei, Pan Li

Title: Generalization Principles for Inference over Text-Attributed Graphs with Large Language Models

Abstract:
Large language models (LLMs) have recently been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce. Among these applications, inference over text-attributed graphs (TAGs) presents unique challenges: existing methods struggle with LLMs' limited context length for processing large node neighborhoods and the misalignment between node embeddings and the LLM token space. To address these issues, we establish two key principles for ensuring generalization and derive the framework LLM-BP accordingly: (1)Unifying the attribute space with task-adaptive embeddings, where we leverage LLM-based encoders and task-aware prompting to enhance generalization of the text attribute embeddings; (2)Developing a generalizable graph information aggregation mechanism, for which we adopt belief propagation with LLM-estimated parameters that adapt across graphs. Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches, achieving 8.10% improvement with task-conditional embeddings and an additional 1.71% gain from adaptive aggregation. The code and task-adaptive embeddings are publicly available.

Paperid:2710 poster

Authors:Damjan Kalajdzievski

Title: The Logical Implication Steering Method for Conditional Interventions on Transformer Generation

Abstract:
The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand-engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.

Paperid:2711 poster

Authors:Yulun Wu, Doron L Bergman

Title: Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer

Abstract:
We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without using any real-world dataset to pre-train the model, extending on the recent development of Prior-Data Fitted Networks (PFNs) and TabPFN. Specifically, APT is pre-trained with adversarial synthetic data agents, who continue to shift their underlying data generating distribution and deliberately challenge the model with different synthetic datasets. In addition, we propose a mixture block model architecture that is able to handle classification tasks with arbitrary number of classes, addressing the class size limitation -- a crucial weakness of prior tabular zero-shot learning algorithms. In experiments, we show that our framework matches state-of-the-art performance on small tabular classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance TabPFN's performance. In our analysis, we demonstrate that the adversarial synthetic data agents were able to generate a more diverse collection of data compared to the ordinary random generator in TabPFN. In addition, we demonstrate that our mixture block neural design has improved generalizability and greatly accelerated pre-training.

Paperid:2712 poster

Authors:Anas Jnini, Lorenzo Breschi, Flavio Vella

Title: Riemann Tensor Neural Networks: Learning Conservative Systems with Physics-Constrained Networks

Abstract:
Divergence-free symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approximate any sufficiently smooth DFST with arbitrary precision and demonstrate their effectiveness as surrogates for conservative PDEs, achieving improved accuracy across benchmarks. This work is the first to use DFSTs as an inductive bias in neural PDE surrogates and to explicitly enforce the conservation of both mass and momentum within a physics-constrained neural architecture.

Paperid:2713 poster

Authors:Mohammad Mozaffari, Amir Yazdanbakhsh, Maryam Mehri Dehnavi

Title: SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Abstract:
Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, one-shot compression methods eliminate retraining cost, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process. First, we formulate the quantization process using a probabilistic approach (SLIM-Quant) that enables us to apply uniform quantization. Then, we use an existing one-shot pruning method to apply semi-structured sparsity on top of the quantized weights. Finally, to compensate for the introduced aggregated quantization and sparsity error, we use a novel saliency function with unique invertible and additive features that enables us to mathematically compute the value of low-rank adapters. SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods. Models compressed with SLIM achieve up to 4.3× and 3.8× on Nvidia RTX3060 and A100 GPUs, respectively. Additionally, they achieve up to 0.23× end-to-end memory reduction in comparison to their dense counterparts. We also propose an optional PEFT recipe that further improves accuracy by up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.

Paperid:2714 poster

Authors:Arto Maranjyan, El Mehdi Saad, Peter Richtárik, Francesco Orabona

Title: ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning

Abstract:
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resource-efficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times. Experimental results further demonstrate that ATA is resource-efficient, significantly reducing costs compared to the greedy approach, which can be arbitrarily expensive depending on the number of workers.

Paperid:2715 poster

Authors:Ziyu Gong, Jim Lim, David I. Inouye

Title: Expressive Score-Based Priors for Distribution Matching with Geometry-Preserving Regularization

Abstract:
Distribution matching (DM) is a versatile domain-invariant representation learning technique that has been applied to tasks such as fair classification, domain adaptation, and domain translation. Non-parametric DM methods struggle with scalability and adversarial DM approaches suffer from instability and mode collapse. While likelihood-based methods are a promising alternative, they often impose unnecessary biases through fixed priors or require explicit density models (e.g., flows) that can be challenging to train. We address this limitation by introducing a novel approach to training likelihood-based DM using expressive score-based prior distributions. Our key insight is that gradient-based DM training only requires the prior's score function---not its density---allowing us to train the prior via denoising score matching. This approach eliminates biases from fixed priors (e.g., in VAEs), enabling more effective use of geometry-preserving regularization, while avoiding the challenge of learning an explicit prior density model (e.g., a flow-based prior). Our method also demonstrates better stability and computational efficiency compared to other diffusion-based priors (e.g., LSGM). Furthermore, experiments demonstrate superior performance across multiple tasks, establishing our score-based method as a stable and effective approach to distribution matching. Source code available athttps://github.com/inouye-lab/SAUB.

Paperid:2716 poster

Authors:Yuanwei Zhang, Fengmiao Bian, Xiaoqun Zhang, Jian-Feng Cai

Title: Preconditioned Riemannian Gradient Descent Algorithm for Low-Multilinear-Rank Tensor Completion

Abstract:
Tensors play a crucial role in numerous scientific and engineering fields. This paper addresses the low-multilinear-rank tensor completion problem, a fundamental task in tensor-related applications. By exploiting the manifold structure inherent to the fixed-multilinear-rank tensor set, we introduce a simple yet highly effective preconditioned Riemannian metric and propose the Preconditioned Riemannian Gradient Descent (PRGD) algorithm. Compared to the standard Riemannian Gradient Descent (RGD), PRGD achieves faster convergence while maintaining the same order of per-iteration computational complexity. Theoretically, we provide the recovery guarantee for PRGD under near-optimal sampling complexity. Numerical results highlight the efficiency of PRGD, outperforming state-of-the-art methods on both synthetic data and real-world video inpainting tasks.

Paperid:2717 poster

Authors:Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Li Erran Li

Title: Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

Abstract:
A generalist foundation model agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent’s skill repertoire will necessarily be limited due to the scalability of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator (PAE), an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. After a context-aware task proposer generates instructions based on website information, the agent policy attempts those tasks in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and selfhosted websites from WebVoyager and WebArena. Our results show that PAE significantly improves the zero-shot generalization capability of VLM Internet agents (around 50% relative improvement) to both unseen tasks and websites.

Paperid:2718 poster

Authors:Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong

Title: Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

Abstract:
Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes-consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strongH-consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

Paperid:2719 poster

Authors:Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, Yuki Mitsufuji

Title: Supervised Contrastive Learning from Weakly-Labeled Audio Segments for Musical Version Matching

Abstract:
Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.

Paperid:2720 poster

Authors:Xichen Guo, Feng Xie, Yan Zeng, Hao Zhang, Zhi Geng

Title: Data-Driven Selection of Instrumental Variables for Additive Nonlinear, Constant Effects Models

Abstract:
We consider the problem of selecting instrumental variables from observational data, a fundamental challenge in causal inference.Existing methods mostly focus on additive linear, constant effects models, limiting their applicability in complex real-world scenarios. In this paper, we tackle a more general and challenging setting: the additive non-linear, constant effects model. We first propose a novel testable condition, termed the Cross Auxiliary-based independent Test (CAT) condition, for selecting the valid IV set. We show that this condition is both necessary and sufficient for identifying valid instrumental variable sets within such a model under milder assumptions. Building on this condition, we develop a practical algorithm for selecting the set of valid instrumental variables. Extensive experiments on both synthetic and two real-world datasets demonstrate the effectiveness and robustness of our proposed approach, highlighting its potential for broader applications in causal analysis.

Paperid:2721 poster

Authors:Daniel Zilberg, Ron Levie

Title: PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities

Abstract:
We propose PieClam (Prior Inclusive Exclusive Cluster Affiliation Model): a graph autoencoder, where nodes are embedded into a code space by an algorithm that maximizes the log-likelihood of the decoded graph. PieClam is a community affiliation model that extends well-known methods like BigClam in two main manners. First, instead of the decoder being defined via pairwise interactions between the nodes in the code space, we also incorporate a learned prior on the distribution of nodes in the code space, turning our method into a graph generative model. Secondly, we generalize the notion of communities by allowing not only sets of nodes with strong connectivity, which we call inclusive communities, but also sets of nodes with strong disconnection, which we call exclusive communities. By introducing a new graph similarity measure, called the log cut distance, we show that PieClam is a universal autoencoder, able to uniformly approximately reconstruct any graph. Our method is shown to obtain competitive performance in graph anomaly detection and link prediction benchmarks.

Paperid:2722 poster

Authors:Hui Zeng, Wenke Huang, Tongqing Zhou, Xinyi Wu, Guancheng Wan, Yingwen Chen, Zhiping Cai

Title: Does One-shot Give the Best Shot? Mitigating Model Inconsistency in One-shot Federated Learning

Abstract:
Turning the multi-round vanilla Federated Learning into one-shot FL (OFL) significantly reduces the communication burden and makes a big leap toward practical deployment. However, this work empirically and theoretically unravels that existing OFL falls into a garbage (inconsistent one-shot local models) in and garbage (degraded global model) out pitfall. The inconsistency manifests as divergent feature representations and sample predictions. This work presents a novel OFL framework FAFI that enhances the one-shot training on the client side to essentially overcome inferior local uploading. Specifically, unsupervised feature alignment and category-wise prototype learning are adopted for clients' local training to be consistent in representing local samples. On this basis, FAFI uses informativeness-aware feature fusion and prototype aggregation for global inference. Extensive experiments on three datasets demonstrate the effectiveness of FAFI, which facilitates superior performance compared with 11 OFL baselines (+10.86% accuracy). Code available athttps://github.com/zenghui9977/FAFI_ICML25

Paperid:2723 poster

Authors:Zhiwei Xu, Kun Hu, Xin Xin, Weiliang Meng, Yiwei Shi, Hangyu Mao, Bin Zhang, Dapeng Li, Jiangjin Yin

Title: Reidentify: Context-Aware Identity Generation for Contextual Multi-Agent Reinforcement Learning

Abstract:
Generalizing multi-agent reinforcement learning (MARL) to accommodate variations in problem configurations remains a critical challenge in real-world applications, where even subtle differences in task setups can cause pre-trained policies to fail. To address this, we propose Context-Aware Identity Generation (CAID), a novel framework to enhance MARL performance under the Contextual MARL (CMARL) setting. CAID dynamically generates unique agent identities through the agent identity decoder built on a causal Transformer architecture. These identities provide contextualized representations that align corresponding agents across similar problem variants, facilitating policy reuse and improving sample efficiency. Furthermore, the action regulator in CAID incorporates these agent identities into the action-value space, enabling seamless adaptation to varying contexts. Extensive experiments on CMARL benchmarks demonstrate that CAID significantly outperforms existing approaches by enhancing both sample efficiency and generalization across diverse context variants.

Paperid:2724 poster

Authors:Haoye Lu, Qifan Wu, Yaoliang Yu

Title: Stochastic Forward–Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets

Abstract:
Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain FID 6.31 on CIFAR-10 with just 4% clean images (and 3.58 with 10%). We also provide theoretical guarantees that SFBD learns the true data distribution. These results underscore the value of limited clean pretraining, or pretraining on similar datasets. Empirical studies further validate and enrich our findings.

Paperid:2725 poster

Authors:Yusheng Zhao, Qixin Zhang, Xiao Luo, Junyu Luo, Wei Ju, Zhiping Xiao, Ming Zhang

Title: Test-time Adaptation on Graphs via Adaptive Subgraph-based Selection and Regularized Prototypes

Abstract:
Test-time adaptation aims to adapt a well-trained model using test data only, without accessing training data. It is a crucial topic in machine learning, enabling a wide range of applications in the real world, especially when it comes to data privacy. While existing works on test-time adaptation primarily focus on Euclidean data, research on non-Euclidean graph data remains scarce. Prevalent graph neural network methods could encounter serious performance degradation in the face of test-time domain shifts. In this work, we propose a novel method named Adaptive Subgraph-based Selection and Regularized Prototype Supervision (ASSESS) for reliable test-time adaptation on graphs. Specifically, to achieve flexible selection of reliable test graphs, ASSESS adopts an adaptive selection strategy based on fine-grained individual-level subgraph mutual information. Moreover, to utilize the information from both training and test graphs, ASSESS constructs semantic prototypes from the well-trained model as prior knowledge from the unknown training graphs and optimizes the posterior given the unlabeled test graphs. We also provide a theoretical analysis of the proposed algorithm. Extensive experiments verify the effectiveness of ASSESS against various baselines.

Paperid:2726 poster

Authors:Minghao Fu, Guo-Hua Wang, Liangfu Cao, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

Title: CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation

Abstract:
Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks. The code is publicly available at github.com/AIDC-AI/CHATS.

Paperid:2727 poster

Authors:Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Shijie Xu, weihongluo, Yuhua Li, xiuqiang He, Ruixuan Li

Title: The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning

Abstract:
To improve the training efficiency of federated learning (FL), previous research has employed low-rank decomposition techniques to reduce communication overhead. In this paper, we seek to enhance the performance of these low-rank decomposition methods. Specifically, we focus on three key issues related to decomposition in FL: what to decompose, how to decompose, and how to aggregate. Subsequently, we introduce three novel techniques: Model Update Decomposition (MUD), Block-wise Kronecker Decomposition (BKD), and Aggregation-Aware Decomposition (AAD), each targeting a specific issue. These techniques are complementary and can be applied simultaneously to achieve optimal performance. Additionally, we provide a rigorous theoretical analysis to ensure the convergence of the proposed MUD. Extensive experimental results show that our approach achieves faster convergence and superior accuracy compared to relevant baseline methods. The code is available athttps://github.com/Leopold1423/fedmud-icml25.

Paperid:2728 poster

Authors:Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen M. Meng

Title: ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

Abstract:
Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.\footnote{http://dongchaoyang.top/ALMTokenizer/}

Paperid:2729 poster

Authors:Mohammed Adnan, Rohan Jain, Ekansh Sharma, Rahul Krishnan, Yani Ioannou

Title: Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry

Abstract:
The Lottery Ticket Hypothesis (LTH) suggests there exists a sparse LTH mask and weights that achieve the same generalization performance as the dense model while using significantly fewer parameters. However, finding a LTH solution is computationally expensive, and a LTH sparsity mask does not generalize to other random weight initializations. Recent work has suggested that neural networks trained from random initialization find solutions within the same basin modulo permutation, and proposes a method to align trained models within the same loss basin. We hypothesize that misalignment of basins is the reason why LTH masks do not generalize to new random initializations and propose permuting the LTH mask to align with the new optimization basin when performing sparse training from a different random init. We empirically show a significant increase in generalization when sparse training from random initialization with the permuted mask as compared to using the non-permuted LTH mask, on multiple datasets (CIFAR-10/100 & ImageNet) and models (VGG11 & ResNet20/50).

Paperid:2730 poster

Authors:Daoyu Wang, Mingyue Cheng, Zhiding Liu, Qi Liu

Title: TimeDART: A Diffusion Autoregressive Transformer for Self-Supervised Time Series Representation

Abstract:
Self-supervised learning has garnered increasing attention in time series analysis for benefiting various downstream tasks and reducing reliance on labeled data. Despite its effectiveness, existing methods often struggle to comprehensively capture both long-term dynamic evolution and subtle local patterns in a unified manner. In this work, we propose \textbf{TimeDART}, a novel self-supervised time series pre-training framework that unifies two powerful generative paradigms to learn more transferable representations. Specifically, we first employ a causal Transformer encoder, accompanied by a patch-based embedding strategy, to model the evolving trends from left to right. Building on this global modeling, we further introduce a denoising diffusion process to capture fine-grained local patterns through forward diffusion and reverse denoising. Finally, we optimize the model in an autoregressive manner. As a result, TimeDART effectively accounts for both global and local sequence features in a coherent way.We conduct extensive experiments on public datasets for time series forecasting and classification. The experimental results demonstrate that TimeDART consistently outperforms previous compared methods, validating the effectiveness of our approach.Our code is available at \url{https://github.com/Melmaphother/TimeDART}.

Paperid:2731 poster

Authors:Xuetong Li, Danyang Huang, Hansheng Wang

Title: Pairwise Maximum Likelihood For Multi-Class Logistic Regression Model With Multiple Rare Classes

Abstract:
We study in this work the problem of multi-class logistic regression with one major class and multiple rare classes, which is motivated by a real application in TikTok live stream data. The model is inspired by the two-class logistic regression model of Wang (2020) but with surprising theoretical findings, which in turn motivate new estimation methods with excellent statistical and computational efficiency. Specifically, since rigorous theoretical analysis suggests that the resulting maximum likelihood estimators of different rare classes should be asymptotically independent, we consider to solve multiple pairwise two-class logistic regression problems instead of optimizing the joint log-likelihood function with computational challenge in multi-class problem, which are computationally much easier and can be conducted in a fully parallel way. To further reduce the computation cost, a subsample-based pairwise likelihood estimator is developed by down-sampling the major class. We show rigorously that the resulting estimators could be as asymptotically efficient as the global maximum likelihood estimator under appropriate regularity conditions. Extensive simulation studies are presented to support our theoretical findings and a TikTok live stream dataset is analyzed for illustration purpose.

Paperid:2732 poster

Authors:Mengmeng Chen, Xiaohu Wu, QIQI LIU, Tiantian He, Yew-Soon Ong, Yaochu Jin, Qicheng Lao, Han Yu

Title: Voronoi-grid-based Pareto Front Learning and Its Application to Collaborative Federated Learning

Abstract:
Multi-objective optimization (MOO) exists extensively in machine learning, and aims to find a set of Pareto-optimal solutions, called the Pareto front, e.g., it is fundamental for multiple avenues of research in federated learning (FL). Pareto-Front Learning (PFL) is a powerful method implemented using Hypernetworks (PHNs) to approximate the Pareto front. This method enables the acquisition of a mapping function from a given preference vector to the solutions on the Pareto front. However, most existing PFL approaches still face two challenges: (a) sampling rays in high-dimensional spaces; (b) failing to cover the entire Pareto Front which has a convex shape. Here, we introduce a novel PFL framework, called as PHN-HVVS, which decomposes the design space into Voronoi grids and deploys a genetic algorithm (GA) for Voronoi grid partitioning within high-dimensional space. We put forward a new loss function, which effectively contributes to more extensive coverage of the resultant Pareto front and maximizes the HV Indicator. Experimental results on multiple MOO machine learning tasks demonstrate that PHN-HVVS outperforms the baselines significantly in generating Pareto front. Also, we illustrate that PHN-HVVS advances the methodologies of several recent problems in the FL field. The code is available athttps://github.com/buptcmm/phnhvvs.

Paperid:2733 poster

Authors:Kiran Koshy Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, Branislav Kveton

Title: Comparing Few to Rank Many: Active Human Preference Learning Using Randomized Frank-Wolfe Method

Abstract:
We study learning human preferences from limited comparison feedback, a core machine learning problem that is at the center of reinforcement learning from human feedback (RLHF). We formulate the problem as learning a Plackett-Luce (PL) model from a limited number ofK-subset comparisons over a universe ofNitems, where typicallyK≪N. Our objective is to select theK-subsets such that all items can be ranked with minimal mistakes within the budget. We solve the problem using the D-optimal design, which minimizes the worst-case ranking loss under the estimated PL model. All known algorithms for this problem are computationally infeasible in our setting because we consider exponentially many subsets inK. To address this challenge, we propose a randomized Frank-Wolfe algorithm with memoization and sparse updates that has a lowO(N2+K2)per-iteration complexity. We analyze it and demonstrate its empirical superiority on synthetic and open-source NLP datasets.

Paperid:2734 poster

Authors:Jing Qiao, Yu Liu, YUAN YUAN, Xiao Zhang, Zhipeng Cai, Dongxiao Yu

Title: How Distributed Collaboration Influences the Diffusion Model Training? A Theoretical Perspective

Abstract:
This paper examines the theoretical performance of distributed diffusion models in environments where computational resources and data availability vary significantly among workers. Traditional models centered on single-worker scenarios fall short in such distributed settings, particularly when some workers are resource-constrained. This discrepancy in resources and data diversity challenges the assumption of accurate score function estimation foundational to single-worker models. We establish the inaugural generation error bound for distributed diffusion models in resource-limited settings, establishing a linear relationship with the data dimension $d$ and consistency with established single-worker results. Our analysis highlights the critical role of hyperparameter selection in influencing the training dynamics, which are key to the performance of model generation. This study provides a streamlined theoretical approach to optimizing distributed diffusion models, paving the way for future research in this area.

Paperid:2735 poster

Authors:Quan Wei, Chung-Yiu Yau, Hoi To Wai, Yang Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong

Title: RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

Abstract:
Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia, Qwen and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures. Our code is available athttps://github.com/OptimAI-Lab/RoSTE.

Paperid:2736 poster

Authors:Dongliang Guo, Mengxuan Hu, Zihan Guan, Thomas Hartvigsen, Sheng Li

Title: BalancEdit: Dynamically Balancing the Generality-Locality Trade-off in Multi-modal Model Editing

Abstract:
Large multi-modal models inevitably decay over time as facts update and previously learned information becomes outdated. Traditional approaches such as fine-tuning are often impractical for updating these models due to their size and complexity. Instead, direct knowledge editing within the models presents a more viable solution. Current model editing techniques, however, typically overlook the unique influence ranges of different facts, leading to compromised model performance in terms of both generality and locality. To address this issue, we introduce the concept of the generality-locality trade-off in multi-modal model editing. We develop a new model editing dataset named OKEDIT, specifically designed to effectively evaluate this trade-off. Building on this foundation, we propose \textbf{BalancEdit}, a novel method for balanced model editing that dynamically achieves an optimal balance between generality and locality. BalancEdit utilizes a unique mechanism that generates both positive and negative samples for each fact to accurately determine its influence scope and incorporates these insights into the model's latent space using a discrete, localized codebook of edits, without modifying the underlying model weights. To our knowledge, this is the first approach explicitly addressing the generality-locality trade-off in multi-modal model editing. Our comprehensive results confirm the effectiveness of BalancEdit, demonstrating minimal trade-offs while maintaining robust editing capabilities. Our code and dataset are available athttps://github.com/donglgcn/BalancEdit/tree/MMOKVQA.

Paperid:2737 poster

Authors:Bowen Tan, Zheng Xu, Eric P. Xing, Zhiting Hu, Shanshan Wu

Title: Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

Abstract:
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis withConTrollability andCLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

Paperid:2738 poster

Authors:Zeyu Jia, Alexander Rakhlin, Tengyang Xie

Title: Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

Abstract:
Process and outcome supervision represent two fundamental approaches to reinforcement learning, especially for complex reasoning tasks in large language models. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data.In this paper, we provide a possible theoretical resolution to this debate. Perhaps surprisingly, our main theorem shows that:under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision. At the core of this result lies the novelChange of Trajectory Measure Lemma---a powerful technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a simple yet powerful connection between outcome and process supervision. These findings suggest that the empirically observed performance gap between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data and algorithm design for reinforcement learning.

Paperid:2739 poster

Authors:Bhargav Ganguly, Yang Xu, Vaneet Aggarwal

Title: Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes

Abstract:
This paper investigates the potential of quantum acceleration in addressing infinite horizon Markov Decision Processes (MDPs) to enhance average reward outcomes. We introduce an innovative quantum framework for the agent's engagement with an unknown MDP, extending the conventional interaction paradigm. Our approach involves the design of an optimism-driven tabular Reinforcement Learning algorithm that harnesses quantum signals acquired by the agent through efficient quantum mean estimation techniques. Through thorough theoretical analysis, we demonstrate that the quantum advantage in mean estimation leads to exponential advancements in regret guarantees for infinite horizon Reinforcement Learning. Specifically, the proposed Quantum algorithm achieves a regret bound of $\tilde{\mathcal{O}}(1)$\footnote{$\tilde{\mathcal{O}}(\cdot)$ conceals logarithmic terms of $T$.}, a significant improvement over the $\tilde{\mathcal{O}}(\sqrt{T})$ bound exhibited by classical counterparts, where $T$ is the length of the time horizon.

Paperid:2740 poster

Authors:Velibor Bojkovic, Xiaofeng Wu, Bin Gu

Title: Temporal Misalignment in ANN-SNN Conversion and its Mitigation via Probabilistic Spiking Neurons

Abstract:
Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to Artificial Neural Networks (ANNs) by mimicking biological neural principles, establishing them as a promising approach to mitigate the increasing energy demands of large-scale neural models. However, fully harnessing the capabilities of SNNs remains challenging due to their discrete signal processing and temporal dynamics. ANN-SNN conversion has emerged as a practical approach, enabling SNNs to achieve competitive performance on complex machine learning tasks. In this work, we identify a phenomenon in the ANN-SNN conversion framework, termedtemporal misalignment, in which random spike rearrangement across SNN layers leads to performance improvements. Based on this observation, we introduce biologically plausible two-phase probabilistic (TPP) spiking neurons, further enhancing the conversion process. We demonstrate the advantages of our proposed method both theoretically and empirically through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet across a variety of architectures, achieving state-of-the-art results.

Paperid:2741 poster

Authors:Moritz Vandenhirtz, Julia E Vogt

Title: From Pixels to Perception: Interpretable Predictions via Instance-wise Grouped Feature Selection

Abstract:
Understanding the decision-making process of machine learning models provides valuable insights into the task, the data, and the reasons behind a model's failures. In this work, we propose a method that performs inherently interpretable predictions through the instance-wise sparsification of input images. To align the sparsification with human perception, we learn the masking in the space of semantically meaningful pixel regions rather than on pixel-level. Additionally, we introduce an explicit way to dynamically determine the required level of sparsity for each instance. We show empirically on semi-synthetic and natural image datasets that our inherently interpretable classifier produces more meaningful, human-understandable predictions than state-of-the-art benchmarks.

Paperid:2742 poster

Authors:Shivam Kumar, Yun Yang, Lizhen Lin

Title: A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

Abstract:
In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.

Paperid:2743 poster

Authors:Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, Long Chen

Title: Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Abstract:
With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we proposeCa2-VDM, an efficient autoregressive VDM withCausal generation andCache sharing. Forcausal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. Forcache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available:https://github.com/Dawn-LX/CausalCache-VDM

Paperid:2744 poster

Authors:David Geissbühler, Hatef Otroshi Shahreza, Sébastien Marcel

Title: Synthetic Face Datasets Generation via Latent Space Exploration from Brownian Identity Diffusion

Abstract:
Face recognition models are trained on large-scale datasets, which have privacy and ethical concerns. Lately, the use of synthetic data to complement or replace genuine data for the training of face recognition models has been proposed. While promising results have been obtained, it still remains unclear if generative models can yield diverse enough data for such tasks. In this work, we introduce a new method, inspired by the physical motion of soft particles subjected to stochastic Brownian forces, allowing us to sample identities distributions in a latent space under various constraints. We introduce three complementary algorithms, called Langevin, Dispersion, and DisCo, aimed at generating large synthetic face datasets. With this in hands, we generate several face datasets and benchmark them by training face recognition models, showing that data generated with our method exceeds the performance of previously GAN-based datasets and achieves competitive performance with state-of-the-art diffusion-based synthetic datasets. While diffusion models are shown to memorize training data, we prevent leakage in our new synthetic datasets, paving the way for more responsible synthetic datasets. Project page:https://www.idiap.ch/paper/synthetics-disco

Paperid:2745 poster

Authors:Manan Tayal, Aditya Singh, Shishir Kolathaya, Somil Bansal

Title: A Physics-Informed Machine Learning Framework for Safe and Optimal Control of Autonomous Systems

Abstract:
As autonomous systems become more ubiquitous in daily life, ensuring high performance with guaranteed safety is crucial. However, safety and performance could be competing objectives, which makes their co-optimization difficult. Learning-based methods, such as Constrained Reinforcement Learning (CRL), achieve strong performance but lack formal safety guarantees due to safety being enforced as soft constraints, limiting their use in safety-critical settings. Conversely, formal methods such as Hamilton-Jacobi (HJ) Reachability Analysis and Control Barrier Functions (CBFs) provide rigorous safety assurances but often neglect performance, resulting in overly conservative controllers. To bridge this gap, we formulate the co-optimization of safety and performance as a state-constrained optimal control problem, where performance objectives are encoded via a cost function and safety requirements are imposed as state constraints. We demonstrate that the resultant value function satisfies a Hamilton-Jacobi-Bellman (HJB) equation, which we approximate efficiently using a novel physics-informed machine learning framework. In addition, we introduce a conformal prediction-based verification strategy to quantify the learning errors, recovering a high-confidence safety value function, along with a probabilistic error bound on performance degradation. Through several case studies, we demonstrate the efficacy of the proposed framework in enabling scalable learning of safe and performant controllers for complex, high-dimensional autonomous systems.

Paperid:2746 poster

Authors:Yang Yang, Haonan Xu

Title: Strengthen Out-of-Distribution Detection Capability with Progressive Self-Knowledge Distillation

Abstract:
Out-of-distribution (OOD) detection aims to ensure AI system reliability by rejecting inputs outside the training distribution. Recent work shows that memorizing atypical samples during later stages of training can hurt OOD detection, while strategies for forgetting them show promising improvements. However, directly forgetting atypical samples sacrifices ID generalization and limits the model's OOD detection capability. To address this issue, we propose Progressive Self-Knowledge Distillation (PSKD) framework, which strengthens the OOD detection capability by leveraging self-provided uncertainty-embedded targets. Specifically, PSKD adaptively selects a self-teacher model from the training history using pseudo-outliers, facilitating the learning of uncertainty knowledge via multi-level distillation applied to features and responses. As a result, PSKD achieves better ID generalization and uncertainty estimation for OOD detection. Moreover, PSKD is orthogonal to most existing methods and can be integrated as a plugin to collaborate with them. Experimental results from multiple OOD scenarios verify the effectiveness and general applicability of PSKD.

Paperid:2747 poster

Authors:Mujin Cheon, Jay H Lee, Dong-Yeun Koh, Calvin Tsay

Title: EARL-BO: Reinforcement Learning for Multi-Step Lookahead, High-Dimensional Bayesian Optimization

Abstract:
To avoid myopic behavior, multi-step lookahead Bayesian optimization (BO) algorithms consider the sequential nature of BO and have demonstrated promising results in recent years. However, owing to the curse of dimensionality, most of these methods make significant approximations or suffer scalability issues. This paper presents a novel reinforcement learning (RL)-based framework for multi-step lookahead BO in high-dimensional black-box optimization problems. The proposed method enhances the scalability and decision-making quality of multi-step lookahead BO by efficiently solving the sequential dynamic program of the BO process in a near-optimal manner using RL. We first introduce an Attention-DeepSets encoder to represent the state of knowledge to the RL agent and subsequently propose a multi-task, fine-tuning procedure based on end-to-end (encoder-RL) on-policy learning. We evaluate the proposed method, EARL-BO (Encoder Augmented RL for BO), on synthetic benchmark functions and hyperparameter tuning problems, finding significantly improved performance compared to existing multi-step lookahead and high-dimensional BO methods.

Paperid:2748 poster

Authors:Andi Han, Pierre-Louis Poirion, Akiko Takeda

Title: Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold Method

Abstract:
Optimization with orthogonality constraints frequently arises in various fields such as machine learning. Riemannian optimization offers a powerful framework for solving these problems by equipping the constraint set with a Riemannian manifold structure and performing optimization intrinsically on the manifold. This approach typically involves computing a search direction in the tangent space and updating variables via a retraction operation. However, as the size of the variables increases, the computational cost of the retraction can become prohibitively high, limiting the applicability of Riemannian optimization to large-scale problems. To address this challenge and enhance scalability, we propose a novel approach that restricts each update on a random submanifold, thereby significantly reducing the per-iteration complexity. We introduce two sampling strategies for selecting the random submanifolds and theoretically analyze the convergence of the proposed methods. We provide convergence results for general nonconvex functions and functions that satisfy Riemannian Polyak–Łojasiewicz condition as well as for stochastic optimization settings. Additionally, we demonstrate how our approach can be generalized to quotient manifolds derived from the orthogonal manifold. Extensive experiments verify the benefits of the proposed method, across a wide variety of problems.

Paperid:2749 poster

Authors:Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi

Title: Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

Abstract:
Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players' policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.

Paperid:2750 poster

Authors:Tzu-Tao Chang, Shivaram Venkataraman

Title: LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

Abstract:
Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique to support longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with Llama 3-V, mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 10.62×end-to-end speedup compared to existing approaches.

Paperid:2751 poster

Authors:Laines Schmalwasser, Niklas Penzel, Joachim Denzler, Julia Niebling

Title: FastCAV: Efficient Computation of Concept Activation Vectors for Explaining Deep Neural Networks

Abstract:
Concepts such as objects, patterns, and shapes are how humans understand the world. Building on this intuition, concept-based explainability methods aim to study representations learned by deep neural networks in relation to human-understandable concepts. Here, Concept Activation Vectors (CAVs) are an important tool and can identify whether a model learned a concept or not. However, the computational cost and time requirements of existing CAV computation pose a significant challenge, particularly in large-scale, high-dimensional architectures. To address this limitation, we introduce FastCAV, a novel approach that accelerates the extraction of CAVs by up to 63.6× (on average 46.4×). We provide a theoretical foundation for our approach and give concrete assumptions under which it is equivalent to established SVM-based methods. Our empirical results demonstrate that CAVs calculated with FastCAV maintain similar performance while being more efficient and stable. In downstream applications, i.e., concept-based explanation methods, we show that FastCAV can act as a replacement leading to equivalent insights. Hence, our approach enables previously infeasible investigations of deep models, which we demonstrate by tracking the evolution of concepts during model training.

Paperid:2752 poster

Authors:Raanan Yehezkel Rohekar, Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal

Title: A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment

Abstract:
Are generative pre-trained transformer (GPT) models, trained only to predict the next token, implicitly learning a world model from which sequences are generated one token at a time? We address this question by deriving a causal interpretation of the attention mechanism in GPT and presenting a causal world model that arises from this interpretation. Furthermore, we propose that GPT models, at inference time, can be utilized for zero-shot causal structure learning for input sequences, and introduce a corresponding confidence score. Empirical tests were conducted in controlled environments using the setups of the Othello and Chess strategy games. A GPT, pre-trained on real-world games played with the intention of winning, was tested on out-of-distribution synthetic data consisting of sequences of random legal moves. We find that the GPT model is likely to generate legal next moves for out-of-distribution sequences for which a causal structure is encoded in the attention mechanism with high confidence. In cases where it generates illegal moves, it also fails to capture a causal structure.

Paperid:2753 poster

Authors:Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

Title: MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

Abstract:
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving performance of Transformers trained with ~1.8x--2.4x compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available athttps://github.com/Caiyun-AI/MUDDFormer.

Paperid:2754 poster

Authors:Adam Breuer

Title: E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time

Abstract:
In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity)---exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.

Paperid:2755 poster

Authors:Yuanhuiyi Lyu, Xu Zheng, Lutao Jiang, Yibo Yan, Xin Zou, Huiyu Zhou, Linfeng Zhang, Xuming Hu

Title: RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning

Abstract:
Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we presentthe firstreal-object-based retrieval-augmented generation framework (RealRAG), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever byself-reflective contrastive learning, which injects the generator's knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model's missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application toall typesof state-of-the-art text-to-image generative models and also deliversremarkableperformance boosts with all of them, such as again of16.18%FID scorewith the auto-regressive model on the Stanford Car benchmark.

Paperid:2756 poster

Authors:XUCHEN, Yuxuan Yue, Zukang Xu, Xing Hu, JiangyongYu, Zhixuan Chen, Sifan Zhou, Zhihang Yuan, Dawei Yang

Title: RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

Abstract:
RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models. However, it suffers significant degradation of performance when applied to RWKV. This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy. To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy capable of adaptively selecting different quantization approaches by assessing the uniformity and identifying outliers in the weights, and (2) a codebook optimization algorithm that enhances the performance of cluster-based quantization methods for element-wise multiplication in RWKV. Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1% accuracy loss and 2.14$\times$ speed up.

Paperid:2757 poster

Authors:Gabriele Visentin, Patrick Cheridito

Title: Computing Optimal Transport Maps and Wasserstein Barycenters Using Conditional Normalizing Flows

Abstract:
We present a novel method for efficiently computing optimal transport maps and Wasserstein barycenters in high-dimensional spaces. Our approach uses conditional normalizing flows to approximate the input distributions as invertible pushforward transformations from a common latent space. This makes it possible to directly solve the primal problem using gradient-based minimization of the transport cost, unlike previous methods that rely on dual formulations and complex adversarial optimization. We show how this approach can be extended to compute Wasserstein barycenters by solving a conditional variance minimization problem. A key advantage of our conditional architecture is that it enables the computation of barycenters for hundreds of input distributions, which was computationally infeasible with previous methods. Our numerical experiments illustrate that our approach yields accurate results across various high-dimensional tasks and compares favorably with previous state-of-the-art methods.

Paperid:2758 poster

Authors:Yuankai Luo, Lei Shi, Xiao-Ming Wu

Title: Can Classic GNNs Be Strong Baselines for Graph-level Tasks? Simple Architectures Meet Excellence

Abstract:
Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies. Conversely, Graph Transformers (GTs) are regarded as superior due to their employment of global attention mechanisms, which potentially mitigate these challenges. Literature frequently suggests that GTs outperform GNNs in graph-level tasks, especially for graph classification and regression on small molecular graphs. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic re-evaluation of three classic GNNs—GCN, GIN, and GatedGCN—enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results reveal that, contrary to prevailing beliefs, these classic GNNs consistently match or surpass the performance of GTs, securing top-three rankings across all datasets and achieving first place in eight. Furthermore, they demonstrate greater efficiency, running several times faster than GTs on many datasets. This highlights the potential of simple GNN architectures, challenging the notion that complex mechanisms in GTs are essential for superior graph-level performance. Our source code is available athttps://github.com/LUOyk1999/GNNPlus.

Paperid:2759 poster

Authors:Ao Shen, Mingzhi Yuan, Yingfan MA, Jie Du, Qiao Huang, Manning Wang

Title: Drug-TTA: Test-Time Adaptation for Drug Virtual Screening via Multi-task Meta-Auxiliary Learning

Abstract:
Virtual screening is a critical step in drug discovery, aiming at identifying potential drugs that bind to a specific protein pocket from a large database of molecules. Traditional docking methods are time-consuming, while learning-based approaches supervised by high-precision conformational or affinity labels are limited by the scarcity of training data. Recently, a paradigm of feature alignment through contrastive learning has gained widespread attention. This method does not require explicit binding affinity scores, but it suffers from the issue of overly simplistic construction of negative samples, which limits their generalization to more difficult test cases. In this paper, we propose Drug-TTA, which leverages a large number of self-supervised auxiliary tasks to adapt the model to each test instance. Specifically, we incorporate the auxiliary tasks into both the training and the inference process via meta-learning to improve the performance of the primary task of virtual screening. Additionally, we design a multi-scale feature based Auxiliary Loss Balance Module (ALBM) to balance the auxiliary tasks to improve their efficiency. Extensive experiments demonstrate that Drug-TTA achieves state-of-the-art (SOTA) performance in all five virtual screening tasks under a zero-shot setting, showing an average improvement of 9.86% in AUROC metric compared to the baseline without test-time adaptation.

Paperid:2760 poster

Authors:Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang

Title: Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Abstract:
De novopeptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm calledL―atentI―mputation beforeP―rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available athttps://github.com/usr922/LIPNovo.

Paperid:2761 poster

Authors:Hechuan Wen, Tong Chen, Mingming Gong, Li Kheng Chai, Shazia Sadiq, Hongzhi Yin

Title: Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective

Abstract:
Although numerous complex algorithms for treatment effect estimation have been developed in recent years, their effectiveness remains limited when handling insufficiently labeled training sets due to the high cost of labeling the post-treatment effect, e.g., the expensive tumor imaging or biopsy procedures needed to evaluate treatment effects. Therefore, it becomes essential to actively incorporate more high-quality labeled data, all while adhering to a constrained labeling budget. To enable data-efficient treatment effect estimation, we formalize the problem through rigorous theoretical analysis within the active learning context, where the derived key measures -- factual and counterfactual covering radii determine the risk upper bound. To reduce the bound, we propose a greedy radius reduction algorithm, which excels under an idealized, balanced data distribution. To generalize to more realistic data distributions, we further propose FCCM, which transforms the optimization objective into the Factual and Counterfactual Coverage Maximization to ensure effective radius reduction during data acquisition. Furthermore, benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets. Code:https://github.com/uqhwen2/FCCM.

Paperid:2762 poster

Authors:Freya Behrens, Luca Biggio, Lenka Zdeborova

Title: Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Abstract:
Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.

Paperid:2763 poster

Authors:Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

Title: Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Abstract:
Despite the success of Transformers, handling longer contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention, which often requires post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000×the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-krelevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner, which adapts better to causal language models. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is1000×the training length.

Paperid:2764 poster

Authors:Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, Chun Yuan

Title: Whoever Started the interference Should End It: Guiding Data-Free Model Merging via Task Vectors

Abstract:
Model merging seeks to integrate task-specific expert models into a unified architecture while preserving multi-task generalization capabilities, yet parameter interference between constituent models frequently induces performance degradation. Although prior work has explored many merging strategies, resolving interference without additional data for retraining or test-time computation remains challenging. In this paper, we theoretically demonstrate that the task vectors of the linear layer constitute an approximate linear subspace for its corresponding input. Therefore, we can minimize interference under the guidance of task vectors. Based on this insight, we proposeWUDI-Merging(Whoever started the interference shoUld enDIt), a simple yet effective model merging method that eliminates interference without any additional data or rescaling coefficients. Comprehensive empirical evaluations across vision and language benchmarks demonstrate our method's superiority, achieving state-of-the-art performance in data-free model merging scenarios (average 10.9% improvement versus baseline methods) while even outperforming mainstream test-time adaptation approaches by 3.3%, and only very few computing resources are required. The source code and implementation details are available athttps://github.com/nathanielyvo/WUDI-Merging.

Paperid:2765 poster

Authors:Dongze Wu, Yao Xie

Title: Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions

Abstract:
Sampling from high-dimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a method built on Continuous Normalizing Flows (CNFs) for sampling from high-dimensional and multi-modal distributions. AF is trained with a dynamic Optimal Transport (OT) objective incorporating Wasserstein regularization, and guided by annealing procedures, facilitating effective exploration of modes in high-dimensional spaces. Compared to recent NF methods, AF significantly improves training efficiency and stability, with minimal reliance on MC assistance. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight AF’s potential for sampling the least favorable distributions.

Paperid:2766 poster

Authors:Yaolong Yu, Fan Yao, Sinno Jialin Pan

Title: Beyond Self-Interest: How Group Strategies Reshape Content Creation in Recommendation Platforms?

Abstract:
We employ a game-theoretic framework to study the impact of a specific strategic behavior among creators---group behavior---on recommendation platforms. In this setting, creators within a group collaborate to maximize their collective utility. We show that group behavior has a limited effect on the game's equilibrium when the group size is small. However, when the group size is large, group behavior can significantly alter content distribution and user welfare. Specifically, in a top-$K$ recommendation system with exposure-based rewards, we demonstrate that user welfare can suffer a significant loss due to group strategies, and user welfare does not necessarily increase with larger values of $K$ or more random matching, contrasting sharply with the individual creator case. Furthermore, we investigate user welfare guarantees through the lens of the Price of Anarchy (PoA). In the general case, we establish a negative result on the bound of PoA with exposure rewards, proving that it can be arbitrarily large. We then investigate a user engagement rewarding mechanism, which mitigates the issues caused by large group behavior, showing that $\text{PoA}\leq K+1$ in the general case and $\text{PoA}\leq 2$ in the binary case. Empirical results from simulations further support the effectiveness of the user engagement rewarding mechanism.

Paperid:2767 poster

Authors:Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni, Guo-Wei Yang, Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-min Hu

Title: RBench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Abstract:
Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problemsolving, particularly in multi-disciplinary and multimodal contexts. In this paper, we introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (RBench), for assessing the reasoning capability of both language and multimodal models. RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and cross-linguistic alignment, enabling the assessment to be an Olympiad-level multidisciplinary benchmark. We evaluate many models such as o1, GPT-4o, DeepSeek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning. Even the top-performing model OpenAI o1 achieves only 53.2% accuracy on our multimodal evaluation. Data and code are made publicly available athttps://evalmodels.github.io/rbench/

Paperid:2768 poster

Authors:Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, Jiafei Duan

Title: SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

Abstract:
Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduceSAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of86.8% across 18 tasksin the RLBench benchmark, and demonstrates robust generalization onThe Colosseumbenchmark, with only a4.3% performance gapunder diverse environmental perturbations. Building on this foundation, we proposeSAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduceMemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves an average success rate of94.3% on memory-based tasksinMemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems. Project page:sam2act.github.io.

Paperid:2769 poster

Authors:Francesco Cagnetta, Hyunmo Kang, Matthieu Wyart

Title: Learning curves theory for hierarchically compositional data with power-law distributed features

Abstract:
Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into units that are power-law distributed. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars—probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules�?distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the fine details of the learning curve, but not the exponent describing the large-scale behaviour.

Paperid:2770 poster

Authors:Nong Minh Hieu, Antoine Ledent

Title: Generalization Analysis for Supervised Contrastive Representation Learning under Non-IID Settings

Abstract:
Contrastive Representation Learning (CRL) has achieved impressive success in various domains in recent years. Nevertheless, the theoretical understanding of the generalization behavior of CRL has remained limited. Moreover, to the best of our knowledge, the current literature only analyzes generalization bounds under the assumption that the data tuples used for contrastive learning are independently and identically distributed. However, in practice, we are often limited to a fixed pool of reusable labeled data points, making it inevitable to recycle data across tuples to create sufficiently large datasets. Therefore, the tuple-wise independence condition imposed by previous works is invalidated. In this paper, we provide a generalization analysis for the CRL framework under non-$i.i.d.$ settings that adheres to practice more realistically. Drawing inspiration from the literature on U-statistics, we derive generalization bounds which indicate that the required number of samples in each class scales as the logarithm of the covering number of the class of learnable feature representations associated to that class. Next, we apply our main results to derive excess risk bounds for common function classes such as linear maps and neural networks.

Paperid:2771 poster

Authors:Sangyeon Park, Isaac Han, Seungwon Oh, KyungJoong Kim

Title: Activation by Interval-wise Dropout: A Simple Way to Prevent Neural Networks from Plasticity Loss

Abstract:
Plasticity loss, a critical challenge in neural network training, limits a model's ability to adapt to new tasks or shifts in data distribution. While widely used techniques like L2 regularization and Layer Normalization have proven effective in mitigating this issue, Dropout remains notably ineffective. This paper introduces AID (Activation by Interval-wise Dropout), a novel method inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID generates subnetworks by applying Dropout with different probabilities on each preactivation interval. Theoretical analysis reveals that AID regularizes the network, promoting behavior analogous to that of deep linear networks, which do not suffer from plasticity loss. We validate the effectiveness of AID in maintaining plasticity across various benchmarks, including continual learning tasks on standard image classification datasets such as CIFAR10, CIFAR100, and TinyImageNet. Furthermore, we show that AID enhances reinforcement learning performance in the Arcade Learning Environment benchmark.

Paperid:2772 poster

Authors:Wei Liu, Zhongyu Niu, Lang Gao, Zhiying Deng, Jun Wang, Haozhao Wang, Ruixuan Li

Title: Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets

Abstract:
This study investigates the self-rationalization framework constructed with a cooperative game, where a generator initially extracts the most informative segment from raw input, and a subsequent predictor utilizes the selected subset for its input. The generator and predictor are trained collaboratively to maximize prediction accuracy. In this paper, we first uncover a potential caveat: such a cooperative game could unintentionally introduce a sampling bias during rationale extraction. Specifically, the generator might inadvertently create an incorrect correlation between the selected rationale candidate and the label, even when they are semantically unrelated in the original dataset. Subsequently, we elucidate the origins of this bias using both detailed theoretical analysis and empirical evidence. Our findings suggest a direction for inspecting these correlations through attacks, based on which we further introduce an instruction to prevent the predictor from learning the correlations. Through experiments on six text classification datasets and two graph classification datasets using three network architectures (GRUs, BERT, and GCN), we show that our method significantly outperforms recent rationalization methods.

Paperid:2773 poster

Authors:Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J Su, Li Shen

Title: Restoring Calibration for Aligned Large Language Models: A Calibration-Aware Fine-Tuning Approach

Abstract:
One of the key technologies for the success of Large Language Models (LLMs) is preference alignment. However, a notable side effect of preference alignment is poor calibration: while the pre-trained models are typically well-calibrated, LLMs tend to become poorly calibrated after alignment with human preferences. In this paper, we investigate why preference alignment affects calibration and how to address this issue. For the first question, we observe that the preference collapse issue in alignment undesirably generalizes to the calibration scenario, causing LLMs to exhibit overconfidence and poor calibration. To address this, we demonstrate the importance of fine-tuning with domain-specific knowledge to alleviate the overconfidence issue. To further analyze whether this affects the model's performance, we categorize models into two regimes: calibratable and non-calibratable, defined by bounds of Expected Calibration Error (ECE). In the calibratable regime, we propose a calibration-aware fine-tuning approach to achieve proper calibration without compromising LLMs' performance. However, as models are further fine-tuned for better performance, they enter the non-calibratable regime. For this case, we develop an EM-algorithm-based ECE regularization for the fine-tuning loss to maintain low calibration error. Extensive experiments validate the effectiveness of the proposed methods.

Paperid:2774 poster

Authors:Zijian Liu, Zhengyuan Zhou

Title: Improved Last-Iterate Convergence of Shuffling Gradient Methods for Nonsmooth Convex Optimization

Abstract:
We study the convergence of the shuffling gradient method, a popular algorithm employed to minimize the finite-sum function with regularization, in which functions are passed to apply (Proximal) Gradient Descent (GD) one by one whose order is determined by a permutation on the indices of functions. In contrast to its easy implementation and effective performance in practice, the theoretical understanding remains limited. A recent advance by (Liu & Zhou, 2024b) establishes the first last-iterate convergence results under various settings, especially proving the optimal rates for smooth (strongly) convex optimization. However, their bounds for nonsmooth (strongly) convex functions are only as fast as Proximal GD. In this work, we provide the first improved last-iterate analysis for the nonsmooth case demonstrating that the widely used Random Reshuffle ($\textsf{RR}$) and Single Shuffle ($\textsf{SS}$) strategies are both provably faster than Proximal GD, reflecting the benefit of randomness. As an important implication, we give the first (nearly) optimal convergence result for the suffix average under the $\textsf{RR}$ sampling scheme in the general convex case, matching the lower bound shown by (Koren et al., 2022).

Paperid:2775 poster

Authors:Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ip, Sam Kwong

Title: SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning

Abstract:
Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting, thus adapting to evolving requirements. In this paper, we explore the forgetting caused by such incremental training, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model’s knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks�?answer styles, making the results unusable. On the other hand, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can conceal the model’s knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for data style transformations across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization to LoRA’s weight update matrices, enabling the model to retain existing competencies while remaining adaptable to new tasks. Experimental results demonstrate that our overall method, SEFE, achieves state-of-the-art performance.

Paperid:2776 poster

Authors:MAYNARA DONATO DE SOUZA, Cleber Zanchettin

Title: Breaking the Barrier of Hard Samples: A Data-Centric Approach to Synthetic Data for Medical Tasks

Abstract:
Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on real-world data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample's representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data.

Paperid:2777 poster

Authors:Yang Xu, Vaneet Aggarwal

Title: Accelerating Quantum Reinforcement Learning with a Quantum Natural Policy Gradient Based Approach

Abstract:
We address the problem of quantum reinforcement learning (QRL) under model-free settings with quantum oracle access to the Markov Decision Process (MDP). This paper introduces a Quantum Natural Policy Gradient (QNPG) algorithm, which replaces the random sampling used in classical Natural Policy Gradient (NPG) estimators with a deterministic gradient estimation approach, enabling seamless integration into quantum systems. While this modification introduces a bounded bias in the estimator, the bias decays exponentially with increasing truncation levels. This paper demonstrates that the proposed QNPG algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-1.5})$ for queries to the quantum oracle, significantly improving the classical lower bound of $\tilde{\mathcal{O}}(\epsilon^{-2})$ for queries to the MDP.

Paperid:2778 poster

Authors:Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger

Title: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Abstract:
The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that asinglerefusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensionalconcept conesthat mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion ofrepresentational independencethat accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

Paperid:2779 poster

Authors:Dohoon Lee, Jaehyun Park, Hyunwoo J. Kim, Kyogu Lee

Title: Multidimensional Adaptive Coefficient for Inference Trajectory Optimization in Flow and Diffusion

Abstract:
Flow and diffusion models have demonstrated strong performance and training stability across various tasks but lack two critical properties of simulation-based methods: freedom of dimensionality and adaptability to different inference trajectories. To address this limitation, we propose the Multidimensional Adaptive Coefficient (MAC), a plug-in module for flow and diffusion models that extends conventional unidimensional coefficients to multidimensional ones and enables inference trajectory-wise adaptation. MAC is trained via simulation-based feedback through adversarial refinement. Empirical results across diverse frameworks and datasets demonstrate that MAC enhances generative quality with high training efficiency. Consequently, our work offers a new perspective on inference trajectory optimality, encouraging future research to move beyond vector field design and to leverage training-efficient, simulation-based optimization.

Paperid:2780 poster

Authors:Jiao Huang, Qianli Xing, Jinglong Ji, Bo Yang

Title: Code-Generated Graph Representations Using Multiple LLM Agents for Material Properties Prediction

Abstract:
Graph neural networks have recently demonstrated remarkable performance in predicting material properties. Crystalline material data is manually encoded into graph representations. Existing methods incorporate different attributes into constructing representations to satisfy the constraints arising from symmetries of material structure. However, existing methods for obtaining graph representations are specific to certain constraints, which are ineffective when facing new constraints. In this work, we propose a code generation framework with multiple large language model agents to obtain representations named Rep-CodeGen with three iterative stages simulating an evolutionary algorithm. To the best of our knowledge, Rep-CodeGen is the first framework for automatically generating code to obtain representations that can be used when facing new constraints. Furthermore, a type of representation from generated codes by our framework satisfies six constraints, with codes satisfying three constraints as bases. Extensive experiments on two real-world material datasets show that a property prediction method based on such a graph representation achieves state-of-the-art performance in material property prediction tasks.

Paperid:2781 poster

Authors:Kejia Chen, Jiawen Zhang, Jiacong Hu, Yu Wang, Jian Lou, Zunlei Feng, Mingli Song

Title: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models

Abstract:
Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experiment results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page:https://github.com/Thecommonirin/Qresafe.

Paperid:2782 poster

Authors:Yang Shen, Xiu-Shen Wei, Yifan Sun, YuXin Song, Tao Yuan, Jian Jin, He-Yang Xu, Yazhou Yao, Errui Ding

Title: Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Abstract:
Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we rethink the reality that CV adopts discrete and terminological task definitions (e.g., "image segmentation"), and conjecture it is a key barrier that hampers zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks—due to these terminological definitions—deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million "image input $\to$ explanatory instruction $\to$ output" triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be open-sourced.

Paperid:2783 poster

Authors:Tianyi Zhang, Yu Cao, Dianbo Liu

Title: Uncertainty-Based Extensible Codebook for Discrete Federated Learning in Heterogeneous Data Silos

Abstract:
Federated learning (FL), aimed at leveraging vast distributed datasets, confronts a crucial challenge: the heterogeneity of data across different silos. While previous studies have explored discrete representations to enhance model generalization across minor distributional shifts, these approaches often struggle to adapt to new data silos with significantly divergent distributions. In response, we have identified that models derived from FL exhibit markedly increased uncertainty when applied to data silos with unfamiliar distributions. Consequently, we propose an innovative yet straightforward iterative framework, termed \emph{Uncertainty-Based Extensible-Codebook Federated Learning (UEFL)}. This framework dynamically maps latent features to trainable discrete vectors, assesses the uncertainty, and specifically extends the discretization dictionary or codebook for silos exhibiting high uncertainty. Our approach aims to simultaneously enhance accuracy and reduce uncertainty by explicitly addressing the diversity of data distributions, all while maintaining minimal computational overhead in environments characterized by heterogeneous data silos. Extensive experiments across multiple datasets demonstrate that UEFL outperforms state-of-the-art methods, achieving significant improvements in accuracy (by 3%--22.1%) and uncertainty reduction (by 38.83%--96.24%). The source code is available athttps://github.com/destiny301/uefl.

Paperid:2784 poster

Authors:Shu-wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, HUY PHAN, Bo-Ru Lu, Harshavardhan Sundar, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang

Title: Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction

Abstract:
Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters�?93M for our Base and 462M for our Large models.

Paperid:2785 poster

Authors:Yuanchao Dai, Ximing Li, Changchun Li

Title: Confidence Difference Reflects Various Supervised Signals in Confidence-Difference Classification

Abstract:
Training a precise binary classifier with limited supervision in weakly supervised learning scenarios holds considerable research significance in practical settings. Leveraging pairwise unlabeled data with confidence differences has been demonstrated to outperform learning from pointwise unlabeled data. We theoretically analyze the various supervisory signals reflected by confidence differences in confidence difference (ConfDiff) classification and identify challenges arising from noisy signals when confidence differences are small. To address this, we partition the dataset into two subsets with distinct supervisory signals and propose a consistency regularization-based risk estimator to encourage similar outputs for similar instances, mitigating the impact of noisy supervision. We further derive and analyze its estimation error bounds theoretically. Extensive experiments on benchmark and UCI datasets demonstrate the effectiveness of our method. Additionally, to effectively capture the influence of real-world noise on the confidence difference, we artificially perturb the confidence difference distribution and demonstrate the robustness of our method under noisy conditions through comprehensive experiments.

Paperid:2786 poster

Authors:Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li

Title: Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

Abstract:
Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks focus narrowly on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities—workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) 4-bit quantization (GPTQ, AWQ) and 50% pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5-7B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%--3% drop) but degrades real-world application accuracy by 10%--15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios, bridging the gap between algorithmic efficiency and real-world applicability.

Paperid:2787 poster

Authors:Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, Sven Behnke

Title: SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels

Abstract:
Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning (RL) holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we proposeSlot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 - state-of-the-art model-based RL algorithms - across a range of multi-object manipulation environments that require both relational reasoning and dexterous control. Videos and code are available at https:// slot-latent-dynamics.github.io.

Paperid:2788 poster

Authors:YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt

Title: The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products

Abstract:
$E(3)$-equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation. For example, Luo et al. (2024) recently proposed the Gaunt tensor product (GTP) which promises a significant speedup. In this work, we provide a careful, systematic analysis of a number of tensor product operations. In particular, we emphasize that different tensor products are not performing the same operation. The reported speedups typically come at the cost of expressivity. We introduce measures of expressivity and interactability to characterize these differences. In addition, we realized the original implementation of GTP can be greatly simplified by directly using a spherical grid at no cost in asymptotic runtime. This spherical grid approach is faster on our benchmarks and in actual training of the MACE interatomic potential by 30%. Finally, we provide the first systematic microbenchmarks of the various tensor product operations. We find that the theoretical runtime guarantees can differ wildly from empirical performance, demonstrating the need for careful application-specific benchmarking. Code is available athttps://github.com/atomicarchitects/PriceofFreedom

Paperid:2789 poster

Authors:Qiang LI, Michal Yemini, Hoi To Wai

Title: Clipped SGD Algorithms for Performative Prediction: Tight Bounds for Stochastic Bias and Remedies

Abstract:
This paper studies the convergence of clipped stochastic gradient descent (SGD) algorithms with decision-dependent data distribution. Our setting is motivated by privacy preserving optimization algorithms that interact with performative data where the prediction models can influence future outcomes. This challenging setting involves the non-smooth clipping operator and non-gradient dynamics due to distribution shifts. We make two contributions in pursuit for a performative stable solution with these algorithms. First, we characterize the stochastic bias with projected clipped SGD (PCSGD) algorithm which is caused by the clipping operator that prevents PCSGD from reaching a stable solution. When the loss function is strongly convex, we quantify the lower and upper bounds for this stochastic bias and demonstrate a bias amplification phenomenon with the sensitivity of data distribution. When the loss function is non-convex, we bound the magnitude of stationarity bias. Second, we propose remedies to mitigate the bias either by utilizing an optimal step size design for PCSGD, or to apply the recent DiceSGD algorithm [Zhang et al., 2024]. Our analysis is also extended to show that the latter algorithm is free from stochastic bias in the performative setting. Numerical experiments verify our findings.

Paperid:2790 poster

Authors:Yihang Wang, Yuying Qiu, Peng Chen, Kai Zhao, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo

Title: Towards a General Time Series Forecasting Model with Unified Representation and Adaptive Transfer

Abstract:
With the growing availability of multi-domain time series data, there is an increasing demand for general forecasting models pre-trained on multi-source datasets to support diverse downstream prediction scenarios. Existing time series foundation models primarily focus on scaling up pre-training datasets and model sizes to enhance generalization performance. In this paper, we take a different approach by addressing two critical aspects of general forecasting models: (1) how to derive unified representations from heterogeneous multi-domain time series data, and (2) how to effectively capture domain-specific features to enable adaptive transfer across various downstream scenarios. To address the first aspect, we propose Decomposed Frequency Learning as the pre-training task, which leverages frequency-based masking and reconstruction to decompose coupled semantic information in time series, resulting in unified representations across domains. For the second aspect, we introduce the Time Series Register, which captures domain-specific representations during pre-training and enhances adaptive transferability to downstream tasks. Our model achieves the state-of-the-art forecasting performance on seven real-world benchmarks, demonstrating remarkable few-shot and zero-shot capabilities.

Paperid:2791 poster

Authors:Duo Liu, Zhiquan Tan, Linglan Zhao, Zhongqiang Zhang, Xiangzhong Fang, Weiran Huang

Title: Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization

Abstract:
Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the learning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our proposed method, RLCD, achieves superior performance in all classes with negligible extra computation. Comprehensive experiments across seven GCD datasets validate its superiority. Our codes are available athttps://github.com/APORduo/RLCD.

Paperid:2792 poster

Authors:Quan M. Nguyen, Nishant A Mehta, Cristóbal A Guzmán

Title: Beyond Minimax Rates in Group Distributionally Robust Optimization via a Novel Notion of Sparsity

Abstract:
The minimax sample complexity of group distributionally robust optimization (GDRO) has been determined up to alog�?K)factor, whereKis the number of groups. In this work, we venture beyond the minimax perspective via a novel notion of sparsity that we call(λ,β)-sparsity. In short, this condition means that at any parameterθ, there is a set of at mostβgroups whose risks atθare all at leastλlarger than the risks of the other groups. To find anϵ-optimalθ, we show via a novel algorithm and analysis that theϵ-dependent term in the sample complexity can swap a linear dependence onKfor a linear dependence on the potentially much smallerβ. This improvement leverages recent progress in sleeping bandits, showing a fundamental connection between the two-player zero-sum game optimization framework for GDRO and per-action regret bounds in sleeping bandits. We next show an adaptive algorithm which, up to logarithmic factors, obtains a sample complexity bound that adapts to the best(λ,β)-sparsity condition that holds. We also show how to obtain a dimension-free semi-adaptive sample complexity bound with a computationally efficient method. Finally, we demonstrate the practicality of the(λ,β)-sparsity condition and the improved sample efficiency of our algorithms on both synthetic and real-life datasets.

Paperid:2793 poster

Authors:Yinghao Aaron Li, Rithesh Kumar, Zeyu Jin

Title: DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Abstract:
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available athttps://dmospeech.github.io/demo

Paperid:2794 poster

Authors:Jiaxuan Zhang, Emadeldeen Eldele, Fuyuan CAO, Yang Wang, Xiaoli Li, Jiye Liang

Title: Counterfactual Contrastive Learning with Normalizing Flows for Robust Treatment Effect Estimation

Abstract:
Estimating Individual Treatment Effects (ITE) from observational data is challenging due to covariate shift and counterfactual absence. While existing methods attempt to balance distributions globally, they often lack fine-grained sample-level alignment, especially in scenarios with significant individual heterogeneity. To address these issues, we reconsider counterfactual as a proxy to emulate balanced randomization. Furthermore, we derive a theoretical bound that links the expected ITE estimation error to both factual prediction errors and representation distances between factuals and counterfactuals. Building on this theoretical foundation, we propose FCCL, a novel method designed to effectively capture the nuances of potential outcomes under different treatments by (i) generating diffeomorphic counterfactuals that adhere to the data manifold while maintaining high semantic similarity to their factual counterparts, and (ii) mitigating distribution shift via sample-level alignment grounded in our derived generalization-error bound, which considers factual-counterfactual similarity and category consistency. Extensive evaluations on benchmark datasets demonstrate that FCCL outperforms 13 state-of-the-art methods, particularly in capturing individual-level heterogeneity and handling sparse boundary samples.

Paperid:2795 poster

Authors:Nate Veldt, Thomas Stanley, Benjamin W Priest, Trevor Steil, Keita Iwabuchi, T.S. Jayram, Geoffrey Sanders

Title: Approximate Forest Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees

Abstract:
Finding a minimum spanning tree (MST) fornpoints in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takesΩ(n2)time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of trees using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components in the forest into a spanning tree. We prove that optimally solving step (2) still takesΩ(n2)time, but we provide a subquadratic 2.62-approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the heuristic forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.

Paperid:2796 poster

Authors:Ce Zhang, Yixin Han, Yafei Wang, Xiaodong Yan, Linglong Kong, Ting Li, Bei Jiang

Title: Differentially Private Analysis for Binary Response Models: Optimality, Estimation, and Inference

Abstract:
Randomized response (RR) mechanisms constitute a fundamental and effective technique for ensuring label differential privacy (LabelDP). However, existing RR methods primarily focus on the response labels while overlooking the influence of covariates and often do not fully address optimality. To address these challenges, this paper explores optimal LabelDP procedures using RR mechanisms, focusing on achieving optimal estimation and inference in binary response models. We first analyze the asymptotic behaviors of RR binary response models and then optimize the procedure by maximizing the trace of the Fisher Information Matrix within theε- and(ε,δ)-LabelDP constraints. Our theoretical results indicate that the proposed methods achieve optimal LabelDP guarantees while maintaining statistical accuracy in binary response models under mild conditions. Furthermore, we develop private confidence intervals with nominal coverage for statistical inference. Extensive simulation studies and real-world applications confirm that our methods outperform existing approaches in terms of precise estimation, privacy protection, and reliable inference.

Paperid:2797 poster

Authors:Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, Bo Han

Title: From Debate to Equilibrium: Belief‑Driven Multi‑Agent LLM Reasoning via Bayesian Nash Equilibrium

Abstract:
Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON’s ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at:https://github.com/tmlr-group/ECON.

Paperid:2798 poster

Authors:Erpai Luo, Xinran Wei, Lin Huang, Yunyang Li, Han Yang, Zaishuo Xia, Zun Wang, Chang Liu, Bin Shao, Jia Zhang

Title: Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity

Abstract:
Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost—driven by high-order tensor product (TP) operations—restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduceSPHNet, an efficient and scalable equivariant network, that incorporates adaptiveSParsity intoHamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70%. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact.

Paperid:2799 poster

Authors:Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno

Title: Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers

Abstract:
Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we presentQ-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose theToken aware Quantization Estimator(TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduceTemporal Maintenance Distillation(TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency score of 23.40, setting a new benchmark and outperforming the current state-of-the-art quantization methods by1.9×.

Paperid:2800 poster

Authors:Minghao Xu, Jiaze Song, Keming Wu, Xiangxin Zhou, Bin CUI, Wentao Zhang

Title: Modeling All-Atom Glycan Structures via Hierarchical Message Passing and Multi-Scale Pre-training

Abstract:
Understanding the various properties of glycans with machine learning has shown some preliminary promise. However, previous methods mainly focused on modeling the backbone structure of glycans as graphs of monosaccharides (i.e., sugar units), while they neglected the atomic structures underlying each monosaccharide, which are actually important indicators of glycan properties. We fill this blank by introducing the GlycanAA model for All-Atom-wise Glycan modeling. GlycanAA models a glycan as a heterogeneous graph with monosaccharide nodes representing its global backbone structure and atom nodes representing its local atomic-level structures. Based on such a graph, GlycanAA performs hierarchical message passing to capture from local atomic-level interactions to global monosaccharide-level interactions. To further enhance model capability, we pre-train GlycanAA on a high-quality unlabeled glycan dataset, deriving the PreGlycanAA model. We design a multi-scale mask prediction algorithm to endow the model about different levels of dependencies in a glycan. Extensive benchmark results show the superiority of GlycanAA over existing glycan encoders and verify the further improvements achieved by PreGlycanAA. We maintain all resources athttps://github.com/kasawa1234/GlycanAA.

Paperid:2801 poster

Authors:Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten

Title: Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo

Abstract:
A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on ``decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.

Paperid:2802 poster

Authors:Mingyuan Li, Jiahao Wang, Guangsheng Yu, Xu Wang, Qianrun Chen, Wei Ni, Lixiang Li, Haipeng Peng

Title: RobustLight: Improving Robustness via Diffusion Reinforcement Learning for Traffic Signal Control

Abstract:
Reinforcement Learning (RL) optimizes Traffic Signal Control (TSC) to reduce congestion and emissions, but real-world TSC systems face challenges like adversarial attacks and missing data, leading to incorrect signal decisions and increased congestion. Existing methods, limited to offline data predictions, address only one issue and fail to meet TSC's dynamic, real-time needs. We propose RobustLight, a novel framework with an enhanced, plug-and-play diffusion model to improve TSC robustness against noise, missing data, and complex patterns by restoring attacked data. RobustLight integrates two algorithms to recover original data states without altering existing TSC platforms. Using a dynamic state infilling algorithm, it trains the diffusion model online. Experiments on real-world datasets show RobustLight improves recovery performance by up to 50.43% compared to baseline scenarios. It effectively counters diverse adversarial attacks and missing data. The relevant datasets and code are available at Github.

Paperid:2803 poster

Authors:Ron Shapira Weber, Shahar benishay, Andrey Lavrinenko, Shahaf E. Finder, Oren Freifeld

Title: TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning

Abstract:
Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This adaptation, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yields major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. Despite being trained solely on synthetic data, TimePoint generalizes well to real-world time series. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available athttps://github.com/BGU-CS-VIL/TimePoint.

Paperid:2804 poster

Authors:Shanda Li, Shinjae Yoo, Yiming Yang

Title: Maximum Update Parametrization and Zero-Shot Hyperparameter Transfer for Fourier Neural Operators

Abstract:
Fourier Neural Operators (FNOs) offer a principled approach for solving complex partial differential equations (PDEs). However, scaling them to handle more complex PDEs requires increasing the number of Fourier modes, which significantly expands the number of model parameters and makes hyperparameter tuning computationally impractical. To address this, we introduceμTransfer-FNO, a zero-shot hyperparameter transfer technique that enables optimal configurations, tuned on smaller FNOs, to be directly applied to billion-parameter FNOswithoutadditional tuning. Building on the Maximum Update Parametrization (μP) framework, we mathematically derive a parametrization scheme that facilitates the transfer of optimal hyperparameters across models with different numbers of Fourier modes in FNOs, which is validated through extensive experiments on various PDEs. Our empirical study shows thatμTransfer-FNO reduces computational cost for tuning hyperparameters on large FNOs while maintaining or improving accuracy.

Paperid:2805 poster

Authors:Gwen Yidou Weng, Benjie Wang, Guy Van den Broeck

Title: TRACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation

Abstract:
As large language models (LMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on next-token predictions and struggle with global properties that require looking ahead. Existing solutions either post-train LMs for each new attribute—expensive and inflexible—or approximate the Expected Attribute Probability (EAP) of future sequences by sampling or training, which is slow and unreliable for rare attributes. We introduceTRACE(Tractable Probabilistic Reasoning for Adaptable Controllable gEneration), a novel framework that efficiently computes EAP and adapts to new attributes through tractableprobabilisticreasoning and lightweightcontrol. TRACE distills a Hidden Markov Model (HMM) from an LM and pairs it with a small classifier to estimate attribute probabilities, enabling exact EAP computation over the HMM’s predicted futures. This EAP is then used to reweigh the LM’s next-token probabilities for globally compliant continuations. Empirically, TRACE achieves state-of-the-art detoxification results with only 20% decoding overhead, yields 76 low-resource personalized LMs within seconds, and seamlessly extends to composite attributes.

Paperid:2806 poster

Authors:Jiacheng Zhang, Benjamin I. P. Rubinstein, Jingfeng Zhang, Feng Liu

Title: One Stone, Two Birds: Enhancing Adversarial Defense Through the Lens of Distributional Discrepancy

Abstract:
Statistical adversarial data detection(SADD) detects whether an upcoming batch containsadversarial examples(AEs) by measuring the distributional discrepancies betweenclean examples(CEs) and AEs. In this paper, we explore the strength of SADD-based methods by theoretically showing that minimizing distributional discrepancy can help reduce the expected loss on AEs. Despite these advantages, SADD-based methods have a potential limitation: they discard inputs that are detected as AEs, leading to the loss of clean information within those inputs. To address this limitation, we propose a two-pronged adversarial defense method, namedDistributional-discrepancy-basedAdversarialDefense (DAD). In the training phase, DAD first optimizes the test power of themaximum mean discrepancy(MMD) to derive MMD-OPT, which isa stone that kills two birds. MMD-OPT first serves as aguiding signalto minimize the distributional discrepancy between CEs and AEs to train a denoiser. Then, it serves as adiscriminatorto differentiate CEs and AEs during inference. Overall, in the inference stage, DAD consists of a two-pronged process: (1) directly feeding the detected CEs into the classifier, and (2) removing noise from the detected AEs by the distributional-discrepancy-based denoiser. Extensive experiments show that DAD outperforms currentstate-of-the-art(SOTA) defense methods bysimultaneouslyimproving clean and robust accuracy on CIFAR-10 and ImageNet-1K against adaptive white-box attacks. Codes are publicly available at:https://github.com/tmlr-group/DAD.

Paperid:2807 poster

Authors:Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, Beidi Chen

Title: GSM-�? How Do your LLMs Behave over Infinitely Increasing Reasoning Complexity and Context Length?

Abstract:
Recently, long-context large language models (LLMs) have shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs—and the ability to introduce noise by adding unnecessary nodes and edges—we develop a grade-school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-$\infty$ benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-$\infty$ benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.

Paperid:2808 poster

Authors:Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Title: SCISSOR: Mitigating Semantic Bias through Cluster-Aware Siamese Networks for Robust Classification

Abstract:
Shortcut learning undermines model generalization to out-of-distribution data. While the literature attributes shortcuts to biases in superficial features, we show that imbalances in the semantic distribution of sample embeddings induce spurious semantic correlations, compromising model robustness. To address this issue, we propose SCISSOR (Semantic Cluster Intervention for Suppressing ShORtcut), a Siamese network-based debiasing approach that remaps the semantic space by discouraging latent clusters exploited as shortcuts. Unlike prior data-debiasing approaches, SCISSOR eliminates the need for data augmentation and rewriting. We evaluate SCISSOR on 6 models across 4 benchmarks: Chest-XRay and Not-MNIST in computer vision, and GYAFC and Yelp in NLP tasks. Compared to several baselines, SCISSOR reports +5.3 absolute points in F1 score on GYAFC, +7.3 on Yelp, +7.7 on Chest-XRay, and +1 on Not-MNIST. SCISSOR is also highly advantageous for lightweight models with �?.5% improvement on F1 for ViT on computer vision datasets and �?1.9% for BERT on NLP. Our study redefines the landscape of model generalization by addressing overlooked semantic biases, establishing SCISSOR as a foundational framework for mitigating shortcut learning and fostering more robust, bias-resistant AI systems.

Paperid:2809 poster

Authors:Benjamin Eyre, David Madras

Title: Regression for the Mean: Auto-Evaluation and Inference with Few Labels through Post-hoc Regression

Abstract:
The availability of machine learning systems that can effectively perform arbitrary tasks has led to synthetic labels from these systems being used in applications of statistical inference, such as data analysis or model evaluation. The Prediction Powered Inference (PPI) framework provides a way of leveraging both a large pool of pseudo-labelled data and a small sample with real, high-quality labels to produce a low-variance, unbiased estimate of the quantity being evaluated for. Most work on PPI considers a relatively sizable set of labelled samples, which can be resource intensive to obtain. However, we find that when labelled data is scarce, the PPI++ method can perform even worse than classical inference. We analyze this phenomenon by relating PPI++ to ordinary least squares regression, which also experiences high variance with small sample sizes, and use this regression framework to better understand the efficacy of PPI. Motivated by this, we present two new PPI-based techniques that leverage robust regressors to produce even lower variance estimators in the few-label regime

Paperid:2810 poster

Authors:Bingshan Hu, ZHIMING HUANG, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde

Title: Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

Abstract:
We address differentially private stochastic bandit problems by leveraging Thompson Sampling with Gaussian priors and Gaussian differential privacy (GDP). We propose DP-TS-UCB, a novel parametrized private algorithm that enables trading off privacy and regret. DP-TS-UCB satisfies $ \tilde{O} \left(T^{0.25(1-\alpha)}\right)$-GDP and achieves $O \left(K\ln^{\alpha+1}(T)/\Delta \right)$ regret bounds, where $K$ is the number of arms, $ \Delta$ is the sub-optimality gap, $T$ is the learning horizon, and $\alpha \in [0,1]$ controls the trade-off between privacy and regret. Theoretically, DP-TS-UCB relies on anti-concentration bounds for the Gaussian distributions, linking the exploration mechanisms of Thompson Sampling and Upper Confidence Bound, which may be of independent research interest.

Paperid:2811 poster

Authors:Eray Erturk, Fahad Kamran, Salar Abbaspourazad, Sean Jewell, Harsh Sharma, Yujie Li, Sinead Williamson, Nicholas J Foti, Joseph Futoma

Title: Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions

Abstract:
Wearable devices record physiological and behavioral signals that can improve health predictions. While foundation models are increasingly used for such predictions, they have been primarily applied to low-level sensor data, despite behavioral data often being more informative due to their alignment with physiologically relevant timescales and quantities. We develop foundation models of such behavioral signals using over 2.5B hours of wearable data from 162K individuals, systematically optimizing architectures and tokenization strategies for this unique dataset. Evaluated on 57 health-related tasks, our model shows strong performance across diverse real-world applications including individual-level classification and time-varying health state prediction. The model excels in behavior-driven tasks like sleep prediction, and improves further when combined with representations of raw sensor data. These results underscore the importance of tailoring foundation model design to wearables and demonstrate the potential to enable new health applications.

Paperid:2812 poster

Authors:Hee Suk Yoon, Eunseop Yoon, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

Title: ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization

Abstract:
We introduce ConfPO, a method for preference learning in Large Language Models (LLMs) that identifies and optimizes preference-critical tokens based solely on the training policy's confidence, without requiring any auxiliary models or compute. Unlike prior Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO), which uniformly adjust all token probabilities regardless of their relevance to preference, ConfPO focuses optimization on the most impactful tokens. This targeted approach improves alignment quality while mitigating overoptimization (i.e., reward hacking) by using the KL divergence budget more efficiently. In contrast to recent token-level methods that rely on credit-assignment models or AI annotators, raising concerns about scalability and reliability, ConfPO is simple, lightweight, and model-free. Experimental results on challenging alignment benchmarks, including AlpacaEval 2 and Arena-Hard, demonstrate that ConfPO consistently outperforms uniform DAAs across various LLMs, delivering better alignment with zero additional computational overhead.

Paperid:2813 poster

Authors:Xingyu Zhou, Yulian Wu, Wenqian Weng, Francesco Orabona

Title: SquareχPO: Differentially Private and Robustχ2-Preference Optimization in Offline Direct Alignment

Abstract:
In this paper, we theoretically study the offline alignment of language models with human preference feedback, under both preference label corruption and privacy protections. To this end, we propose a variant of \texttt{$\chi$PO} -- \texttt{Square}\texttt{$\chi$PO}, which is a simple one-line change of \texttt{$\chi$PO} with the standard log-loss replaced by a new square loss over probability. Thanks to the inherent nice properties of this new loss, we have advanced the state-of-the-art of differentially private and robust alignment. Specifically, for the local model of label privacy, \texttt{Square}\texttt{$\chi$PO} is the first one that attains optimal rate based on single-policy concentrability even with general function approximations. It also gives the first result under the central model of privacy protection over both prompts (responses) and labels. On the robustness side against Huber label corruption, \texttt{Square}\texttt{$\chi$PO} is the first alignment method that has a meaningful theoretical guarantee under general function approximations. More importantly, \texttt{Square}\texttt{$\chi$PO} can address privacy protection and corruption \emph{simultaneously}, where an interesting separation is observed, implying that the order of privacy and corruption matters. Furthermore, we show that \texttt{Square}\texttt{$\chi$PO} can also be easily extended to handle the scenario of the general preference model with state-of-the-art guarantees under corruption and privacy. Last but not least, all of our theoretical guarantees enjoy a unified analysis, building upon a new result on the generalization error bounds of least-square regression under corruption and privacy constraints, which we believe is of independent interest to the community.

Paperid:2814 poster

Authors:Chenbei Lu, Laixi Shi, Zaiwei Chen, Chenye Wu, Adam Wierman

Title: Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization

Abstract:
Factored Markov Decision Processes (FMDPs) offer a promising framework for overcoming the curse of dimensionality in reinforcement learning (RL) by decomposing high-dimensional MDPs into smaller and independently evolving components. Despite their potential, existing studies on FMDPs face three key limitations: reliance on perfectly factorizable models, suboptimal sample complexity guarantees for model-based algorithms, and the absence of model-free algorithms. To address these challenges, we introduce approximate factorization, which extends FMDPs to handle imperfectly factored models. Moreover, we develop a model-based algorithm and a model-free algorithm (in the form of variance-reduced Q-learning), both achieving the first near-minimax sample complexity guarantees for FMDPs. A key novelty in the design of these two algorithms is the development of a graph-coloring-based optimal synchronous sampling strategy. Numerical simulations based on the wind farm storage control problem corroborate our theoretical findings.

Paperid:2815 poster

Authors:Jaejun Lee, Joyce Jiyoung Whang

Title: Structure Is All You Need: Structural Representation Learning on Hyper-Relational Knowledge Graphs

Abstract:
Hyper-relational knowledge graphs (HKGs) enrich knowledge graphs by extending a triplet to a hyper-relational fact, where a set of qualifiers adds auxiliary information to a triplet. While many HKG representation learning methods have been proposed, they often fail to effectively utilize the HKG's structure. This paper demonstrates that thoroughly leveraging the structure of an HKG is crucial for reasoning on HKGs, and a purely structure-based representation learning method can achieve state-of-the-art performance on various link prediction tasks. We propose MAYPL, which learns to initialize representation vectors based on the structure of an HKG and employs an attentive neural message passing consisting of fact-level message computation and entity-centric and relation-centric aggregations, thereby computing the representations based solely on the structure. Due to its structure-driven learning, MAYPL can conduct inductive inferences on new entities and relations. MAYPL outperforms 40 knowledge graph completion methods in 10 datasets, compared with different baseline methods on different datasets to be tested from diverse perspectives.

Paperid:2816 poster

Authors:Xiaolong Xu, Yibo Zhou, Haolong Xiang, Xiaoyong Li, Xuyun Zhang, Lianyong Qi, Wanchun Dou

Title: DocKS-RAG: Optimizing Document-Level Relation Extraction through LLM-Enhanced Hybrid Prompt Tuning

Abstract:
Document-level relation extraction (RE) aims to extract comprehensive correlations between entities and relations from documents. Most of existing works conduct transfer learning on pre-trained language models (PLMs), which allows for richer contextual representation to improve the performance. However, such PLMs-based methods suffer from incorporating structural knowledge, such as entity-entity interactions. Moreover, current works struggle to infer the implicit relations between entities across different sentences, which results in poor prediction. To deal with the above issues, we propose a novel and effective framework, named DocKS-RAG, which introduces extra structural knowledge and semantic information to further enhance the performance of document-level RE. Specifically, we construct a Document-level Knowledge Graph from the observable documentation data to better capture the structural information between entities and relations. Then, a Sentence-level Semantic Retrieval-Augmented Generation mechanism is designed to consider the similarity in different sentences by retrieving the relevant contextual semantic information. Furthermore, we present a hybrid-prompt tuning method on large language models (LLMs) for specific document-level RE tasks. Finally, extensive experiments conducted on two benchmark datasets demonstrate that our proposed framework enhances all the metrics compared with state-of-the-art methods.

Paperid:2817 poster

Authors:Renpu Liu, Hao Wu, Jiawei Zhang, Xin Cheng, Xiangyang Luo, Bin Ma, Jinwei Wang

Title: Pixel2Feature Attack (P2FA): Rethinking the Perturbed Space to Enhance Adversarial Transferability

Abstract:
Adversarial examples have been shown to deceive Deep Neural Networks (DNNs), raising widespread concerns about this security threat. More seriously, as different DNN models share critical features, feature-level attacks can generate transferable adversarial examples, thereby deceiving black-box models in real-world scenarios. Nevertheless, we have theoretically discovered the principle behind the limited transferability of existing feature-level attacks: Their attack effectiveness is essentially equivalent to perturbing features in one step along the direction of feature importance in the feature space, despite performing multiple perturbations in the pixel space. This finding indicates that existing feature-level attacks are inefficient in disrupting features through multiple pixel-space perturbations. To address this problem, we propose a P2FA that efficiently perturbs features multiple times. Specifically, we directly shift the perturbed space from pixel to feature space. Then, we perturb the features multiple times rather than just once in the feature space with the guidance of feature importance to enhance the efficiency of disrupting critical shared features. Finally, we invert the perturbed features to the pixels to generate more transferable adversarial examples. Numerous experimental results strongly demonstrate the superior transferability of P2FA over State-Of-The-Art (SOTA) attacks.

Paperid:2818 poster

Authors:Quan Minh Nguyen, Minh N. Vu, Truc Nguyen, My T. Thai

Title: Theoretically Unmasking Inference Attacks Against LDP-Protected Clients in Federated Vision Models

Abstract:
Federated Learning (FL) enables collaborative learning among clients via a coordinating server while avoiding direct data sharing, offering a perceived solution to preserve privacy. However, recent studies on Membership Inference Attacks (MIAs) have challenged this notion, showing high success rates against unprotected training data. While local differential privacy (LDP) is widely regarded as a gold standard for privacy protection in data analysis, most studies on MIAs either neglect LDP or fail to provide theoretical guarantees for attack success against LDP-protected data.To address this gap, we derive theoretical lower bounds for the success rates of low-polynomial-time MIAs that exploit vulnerabilities in fully connected or self-attention layers, regardless of the LDP mechanism used. We establish that even when data are protected by LDP, privacy risks persist, depending on the privacy budget. Practical evaluations on models like ResNet and Vision Transformer confirm considerable privacy risks, revealing that the noise required to mitigate these attacks significantly degrades models' utility.

Paperid:2819 poster

Authors:Hongwei Li, Yuheng Tang, Shiqi Wang, Wenbo Guo

Title: PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification

Abstract:
Recent research builds various patching agents that combine large language models (LLMs) with non-ML tools and achieve promising results on the state-of-the-art (SOTA) software patching benchmark, SWE-bench. Based on how to determine the patching workflows, existing patching agents can be categorized as agent-based planning methods, which rely on LLMs for planning, and rule-based planning methods, which follow a pre-defined workflow. At a high level, agent-based planning methods achieve high patching performance but with a high cost and limited stability. Rule-based planning methods, on the other hand, are more stable and efficient but have key workflow limitations that compromise their patching performance. In this paper, we propose PatchPilot, an agentic patcher that strikes a balance between patching efficacy, stability, and cost-efficiency. PatchPilot proposes a novel rule-based planning workflow with five components: reproduction, localization, generation, validation, and refinement (where refinement is unique to PatchPilot). We introduce novel and customized designs to each component to optimize their effectiveness and efficiency. Through extensive experiments on the SWE-bench benchmarks, PatchPilot shows a superior performance than existing open-source methods while maintaining low cost (less than 1$ per instance) and ensuring higher stability. We also conduct a detailed ablation study to validate the key designs in each component. Our code is available athttps://github.com/ucsb-mlsec/PatchPilot.

Paperid:2820 poster

Authors:Saman Forouzandeh, Parham Moradi, Mahdi Jalili

Title: SHARP-Distill: A 68× Faster Recommender System with Hypergraph Neural Networks and Language Models

Abstract:
This paper proposes SHARP-Distill (\textbf{S}peedy \textbf{H}ypergraph \textbf{A}nd \textbf{R}eview-based \textbf{P}ersonalised \textbf{Distill}ation), a novel knowledge distillation approach based on the teacher-student framework that combines Hypergraph Neural Networks (HGNNs) with language models to enhance recommendation quality while significantly improving inference time. The teacher model leverages HGNNs to generate user and item embeddings from interaction data, capturing high-order and group relationships, and employing a pre-trained language model to extract rich semantic features from textual reviews. We utilize a contrastive learning mechanism to ensure structural consistency between various representations. The student includes a shallow and lightweight GCN called CompactGCN designed to inherit high-order relationships while reducing computational complexity. Extensive experiments on real-world datasets demonstrate that SHARP-Distill achieves up to 68× faster inference time compared to HGNN and 40× faster than LightGCN while maintaining competitive recommendation accuracy.

Paperid:2821 poster

Authors:Han Jiang, Xiaoyuan Yi, Zhihua Wei, Ziang Xiao, Shu Wang, Xing Xie

Title: Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing

Abstract:
Warning: Contains harmful model outputs.Despite significant advancements, the propensity of Large Language Models (LLMs) to generate harmful and unethical content poses critical challenges. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Although numerous benchmarks have been constructed to assess social bias, toxicity, and ethical issues in LLMs, those static benchmarks suffer fromevaluation chronoeffect, in which, as models rapidly evolve, existing benchmarks may leak into training data or become saturated,overestimatingever-developing LLMs. To tackle this problem, we propose GETA, a novelgenerative evolving testingapproach based on adaptive testing methods in measurement theory. Unlike traditional adaptive testing methods that rely on a static test item pool, GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability. GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity, thus effectively addressing evaluation chronoeffect. We evaluated various popular LLMs with GETA and demonstrated that 1) GETA can dynamically create difficulty-tailored test items and 2) GETA's evaluation results are more consistent with models' performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.

Paperid:2822 poster

Authors:Jie Wen, Yadong Liu, Zhanyan Tang, Yuting He, Yulong Chen, Mu Li, Chengliang Liu

Title: Learning Compact Semantic Information for Incomplete Multi-View Missing Multi-Label Classification

Abstract:
Multi-view data involves various data forms, such as multi-feature, multi-sequence and multimodal data, providing rich semantic information for downstream tasks. The inherent challenge of incomplete multi-view missing multi-label learning lies in how to effectively utilize limited supervision and insufficient data to learn discriminative representation. Starting from the sufficiency of multi-view shared information for downstream tasks, we argue that the existing contrastive learning paradigms on missing multi-view data show limited consistency representation learning ability, leading to the bottleneck in extracting multi-view shared information. In response, we propose to minimize task-independent redundant information by pursuing the maximization of cross-view mutual information. Additionally, to alleviate the hindrance caused by missing labels, we develop a dual-branch soft pseudo-label cross-imputation strategy to improve classification performance. Extensive experiments on multiple benchmarks validate our advantages and demonstrate strong compatibility with both missing and complete data.

Paperid:2823 poster

Authors:Undral Byambadalai, Tomu Hirata, Tatsushi Oka, Shota Yasui

Title: On Efficient Estimation of Distributional Treatment Effects under Covariate-Adaptive Randomization

Abstract:
This paper focuses on the estimation of distributional treatment effects in randomized experiments that use covariate-adaptive randomization (CAR). These include designs such as Efron's biased-coin design and stratified block randomization, where participants are first grouped into strata based on baseline covariates and assigned treatments within each stratum to ensure balance across groups. In practice, datasets often contain additional covariates beyond the strata indicators. We propose a flexible distribution regression framework that leverages off-the-shelf machine learning methods to incorporate these additional covariates, enhancing the precision of distributional treatment effect estimates. We establish the asymptotic distribution of the proposed estimator and introduce a valid inference procedure. Furthermore, we derive the semiparametric efficiency bound for distributional treatment effects under CAR and demonstrate that our regression-adjusted estimator attains this bound. Simulation studies and analyses of real-world datasets highlight the practical advantages of our method.

Paperid:2824 poster

Authors:Hanyang Zhao, Haoxian Chen, Ji Zhang, David Yao, Wenpin Tang

Title: Score as Action: Fine Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

Abstract:
Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area uses adiscrete-timeformulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tuning diffusion models usingcontinuous-timeRL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby connecting to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models, Stable Diffusion v1.5.

Paperid:2825 poster

Authors:Dong HUANG, Guangtao Zeng, Jianbo Dai, Meng Luo, Han Weng, Yuhao QING, Heming Cui, Zhijiang Guo, Jie Zhang

Title: EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning

Abstract:
As large language models (LLMs) play an increasingly important role in code generation, enhancing both correctness and efficiency has become crucial. Current methods primarily focus on correctness, often overlooking efficiency. To address this gap, we introduce SWIFTCODE to improve both aspects by fine-tuning LLMs on a high-quality dataset comprising correct and efficient code samples. Our methodology involves leveraging multiple LLMs to generate diverse candidate code solutions for various tasks across different programming languages. We then evaluate these solutions by directly measuring their execution time and memory usage through local execution. The code solution with the lowest execution time and memory consumption is selected as the final output for each task. Experimental results demonstrate significant improvements when fine-tuning with SWIFTCODE. For instance, Qwen2.5-Coder-7B-Instruct's pass@1 score increases from 44.8% to 57.7%, while the average execution time for correct tasks decreases by 48.4%. SWIFTCODE offers a scalable and effective solution for advancing AI-driven code generation, benefiting both software development and computational problem-solving.

Paperid:2826 poster

Authors:Wonsuk Jang, Thierry Tambe

Title: BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

Abstract:
The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

Paperid:2827 poster

Authors:Siddarth Venkatraman, Mohsin Hasan, Minsu Kim, Luca Scimeca, Marcin Sendera, Yoshua Bengio, Glen Berseth, Nikolay Malkin

Title: Outsourced Diffusion Sampling: Efficient Posterior Inference in Latent Spaces of Generative Models

Abstract:
Any well-behaved generative model over a variablexcan be expressed as a deterministic transformation of an exogenous (‘outsourced') Gaussian noise variablez:x=fθ(z). In such a model (eg, a VAE, GAN, or continuous-time flow-based model), sampling of the target variablex∼pθ(x)is straightforward, but sampling from a posterior distribution of the formp(x∣y)∝pθ(x)r(x,y), whereris a constraint function depending on an auxiliary variabley, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space (z). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samplesfθ(z)are distributed according to the posterior in the data space (x). For many models and constraints, the posterior in noise space is smoother than in data space, making it more suitable for amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably with other inference methods. We demonstrate the proposedoutsourced diffusion samplingin several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.

Paperid:2828 poster

Authors:Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, Tat-Seng Chua

Title: DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization

Abstract:
Text-to-3D generation automates 3D content creation from textual descriptions, which offers transformative potential across various fields. However, existing methods often struggle to align generated content with human preferences, limiting their applicability and flexibility. To address these limitations, in this paper, we propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process, through direct preference optimization. Practically, DreamDPO first constructs pairwise examples, then validates their alignment with human preferences using reward or large multimodal models, and lastly optimizes the 3D representation with a preference-driven loss function. By leveraging relative preferences, DreamDPO reduces reliance on precise quality evaluations while enabling fine-grained controllability through preference-guided optimization. Experiments demonstrate that DreamDPO achieves state-of-the-art results, and provides higher-quality and more controllable 3D content compared to existing methods. The code and models will be open-sourced.

Paperid:2829 poster

Authors:Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, Qiyang Min

Title: Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Abstract:
Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.

Paperid:2830 poster

Authors:Weixuan Liang, Xinwang Liu, KE LIANG, Jiyuan Liu, En Zhu

Title: COKE: Core Kernel for More Efficient Approximation of Kernel Weights in Multiple Kernel Clustering

Abstract:
Inspired by the well-known coreset in clustering algorithms, we introduce the definition of the core kernel for multiple kernel clustering (MKC) algorithms. The core kernel refers to running MKC algorithms on smaller-scale base kernel matrices to obtain kernel weights similar to those obtained from the original full-scale kernel matrices. Specifically, the core kernel refers to a set of kernel matrices of sizeO~(1/ε2)that perform MKC algorithms on them can achieve a(1+ε)-approximation for the kernel weights. Subsequently, we can leverage approximated kernel weights to obtain a theoretically guaranteed large-scale extension of MKC algorithms. In this paper, we propose a core kernel construction method based on singular value decomposition and prove that it satisfies the definition of the core kernel for three mainstream MKC algorithms. Finally, we conduct experiments on several benchmark datasets to verify the correctness of theoretical results and the efficiency of the proposed method.

Paperid:2831 poster

Authors:Sophia Tang, Yinuo Zhang, Pranam Chatterjee

Title: PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion

Abstract:
We presentPepTune, a multi-objective discrete diffusion model for simultaneous generation and optimization of therapeutic peptide SMILES. Built on the Masked Discrete Language Model (MDLM) framework, PepTune ensures valid peptide structures with a novel bond-dependent masking schedule and invalid loss function. To guide the diffusion process, we introduceMonte Carlo Tree Guidance (MCTG), an inference-time multi-objective guidance algorithm that balances exploration and exploitation to iteratively refine Pareto-optimal sequences. MCTG integrates classifier-based rewards with search-tree expansion, overcoming gradient estimation challenges and data sparsity. Using PepTune, we generate diverse, chemically-modified peptides simultaneously optimized for multiple therapeutic properties, including target binding affinity, membrane permeability, solubility, hemolysis, and non-fouling for various disease-relevant targets. In total, our results demonstrate that MCTG for masked discrete diffusion is a powerful and modular approach for multi-objective sequence design in discrete state spaces.

Paperid:2832 poster

Authors:Shengju Yu, Zhibin Dong, Siwei Wang, Suyuan Liu, KE LIANG, Xinwang Liu, Yue Liu, Yi Zhang

Title: From Spectrum-free towards Baseline-view-free: Double-track Proximity Driven Multi-view Clustering

Abstract:
Current multi-view clustering (MVC) techniques generally focus only on the relationship between anchors and samples, while overlooking that between anchors. Moreover, due to the lack of data labels, the cluster order is inconsistent across views and accordingly anchors encounter misalignment, which will confuse the graph structure and disorganize cluster representation. Even worse, it typically brings variance during forming spectral embedding, degenerating the stability of clustering results. In response to these concerns, in the paper we propose a MVC approach named DTP-SF-BVF. Concretely, we explicitly exploit the geometric properties between anchors via self-expression learning skill, and utilize topology learning strategy to feed captured anchor-anchor features into anchor-sample graph so as to explore the manifold structure hidden within samples more adequately. To reduce the misalignment risk, we introduce a permutation mechanism for each view to jointly rearrange anchors according to respective view characteristics. Besides not involving selecting the baseline view, it also can coordinate with anchors in the unified framework and thereby facilitate the learning of anchors. Further, rather than forming spectrum and then performing embedding partitioning, based on the criterion that samples and clusters should be hard assignment, we manage to construct the cluster labels directly from original samples using the binary strategy, not only preserving the data diversity but avoiding variance. Experiments on multiple publicly available datasets confirm the effectiveness of proposed DTP-SF-BVF method.

Paperid:2833 poster

Authors:Tianjie Ju, Yi Hua, Hao Fei, Zhenyu Shao, Yubin Zheng, Haodong Zhao, Mong-Li Lee, Wynne Hsu, Zhuosheng Zhang, Gongshen Liu

Title: Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models

Abstract:
Multi-Modal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available athttps://github.com/illusionhi/ProbingPrivacy.

Paperid:2834 poster

Authors:Duc Anh Nguyen, Ernesto Araya, Adalbert Fono, Gitta Kutyniok

Title: Time to Spike? Understanding the Representational Power of Spiking Neural Networks in Discrete Time

Abstract:
Recent years have seen significant progress in developing spiking neural networks (SNNs) as a potential solution to the energy challenges posed by conventional artificial neural networks (ANNs). However, our theoretical understanding of SNNs remains relatively limited compared to the ever-growing body of literature on ANNs. In this paper, we study a discrete-time model of SNNs based on leaky integrate-and-fire (LIF) neurons, referred to as discrete-time LIF-SNNs, a widely used framework that still lacks solid theoretical foundations. We demonstrate that discrete-time LIF-SNNs realize piecewise constant functions defined on polyhedral regions, and more importantly, we quantify the network size required to approximate continuous functions. Moreover, we investigate the impact of latency (number of time steps) and depth (number of layers) on the complexity of the input space partitioning induced by discrete-time LIF-SNNs. Our analysis highlights the importance of latency and contrasts these networks with ANNs that use piecewise linear activation functions. Finally, we present numerical experiments to support our theoretical findings.

Paperid:2835 poster

Authors:Kuheli Pratihar, Debdeep Mukhopadhyay

Title: Be a Goldfish: Forgetting Bad Conditioning in Sparse Linear Regression via Variational Autoencoders

Abstract:
Variational Autoencoders (VAEs), a class of latent-variable generative models, have seen extensive use in high-fidelity synthesis tasks, yet their loss landscape remains poorly understood. Prior theoretical works on VAE loss analysis have focused on their latent-space representational capabilities, both in the optimal and limiting cases. Although these insights have guided better VAE designs, they also often restrict VAEs to problem settings where classical algorithms, such as Principal Component Analysis (PCA), can trivially guarantee globally optimal solutions. In this work, we push the boundaries of our understanding of VAEs beyond these traditional regimes to tackle NP-hard sparse inverse problems, for which no classical algorithms exist. Specifically, we examine the nontrivial Sparse Linear Regression (SLR) problem of recovering optimal sparse inputs in the presence of an ill-conditioned design matrix having correlated features. We provably show that, under a linear encoder-decoder architecture incorporating the product of the SLR design matrix with a trainable, sparsity-promoting diagonal matrix, any minimum of VAE loss is guaranteed to be an optimal solution. This property is especially useful for identifying (a) a preconditioning factor that reduces the eigenvalue spread, and (b) the corresponding optimal sparse representation. Lastly, our empirical analysis with different types of design matrices validates these findings and even demonstrates a higher recovery rate at low sparsity where traditional algorithms fail. Overall, this work highlights the flexible nature of the VAE loss, which can be adapted to efficiently solve computationally hard problems under specific constraints.

Paperid:2836 poster

Authors:Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun

Title: Lightweight Dataset Pruning without Full Training via Example Difficulty and Prediction Uncertainty

Abstract:
Recent advances in deep learning rely heavily on massive datasets, leading to substantial storage and training costs. Dataset pruning aims to alleviate this demand by discarding redundant examples. However, many existing methods require training a model with a full dataset over a large number of epochs before being able to prune the dataset, which ironically makes the pruning process more expensive than just training the model on the entire dataset. To overcome this limitation, we introduce theDifficulty and Uncertainty-Aware Lightweight (DUAL)score, which aims to identify important samples from the early training stage by considering both example difficulty and prediction uncertainty. To address a catastrophic accuracy drop at an extreme pruning ratio, we further propose a pruning ratio-adaptive sampling using Beta distribution. Experiments on various datasets and learning scenarios such as image classification with label noise and image corruption, and model architecture generalization demonstrate the superiority of our method over previous state-of-the-art (SOTA) approaches. Specifically, on ImageNet-1k, our method reduces the time cost for pruning to 66% compared to previous methods while achieving a SOTA 60% test accuracy at a 90% pruning ratio. On CIFAR datasets, the time cost is reduced to just 15% while maintaining SOTA performance.

Paperid:2837 poster

Authors:Mingyu Kang, Yong Suk Choi

Title: Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation

Abstract:
Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.

Paperid:2838 poster

Authors:Justin Lee, Behnaz Moradijamei, Heman Shakeri

Title: Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points

Abstract:
Modeling the evolution of high-dimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets and apply it to single-cell perturbation data from melanoma cell lines and gene expression data collected at uneven time points.

Paperid:2839 poster

Authors:Bryan Lincoln Marques de Oliveira, Luana Guedes Barros Martins, Bruno Brandão, Murilo Lopes da Luz, telma woerle de lima soares, Luckeciano Carvalho Melo

Title: Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Abstract:
Effective visual representation learning is crucial for reinforcement learning (RL) agents to extract task-relevant information from raw sensory inputs and generalize across diverse environments. However, existing RL benchmarks lack the ability to systematically evaluate representation learning capabilities in isolation from other learning challenges. To address this gap, we introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that transforms the classic 8-tile puzzle into a visual RL task with images drawn from arbitrarily large datasets. SPGym's key innovation lies in its ability to precisely control representation learning complexity through adjustable grid sizes and image pools, while maintaining fixed environment dynamics, observation, and action spaces. This design enables researchers to isolate and scale the visual representation challenge independently of other learning components. Through extensive experiments with model-free and model-based RL algorithms, we uncover fundamental limitations in current methods' ability to handle visual diversity. As we increase the pool of possible images, all algorithms exhibit in- and out-of-distribution performance degradation, with sophisticated representation learning techniques often underperforming simpler approaches like data augmentation. These findings highlight critical gaps in visual representation learning for RL and establish SPGym as a valuable tool for driving progress in robust, generalizable decision-making systems.

Paperid:2840 poster

Authors:Taejin Park, Ivan Medennikov, Kunal Dhawan, Weiqing Wang, He Huang, Nithin Rao Koluguri, Krishna C Puvvada, Jagadeesh Balam, Boris Ginsburg

Title: Sortformer: A Novel Approach for Permutation-Resolved Speaker Supervision in Speech-to-Text Systems

Abstract:
Sortformer is an encoder-based speaker diarization model designed for supervising speaker tagging in speech-to-text models. Instead of relying solely on permutation invariant loss (PIL), Sortformer introduces Sort Loss to resolve the permutation problem, either independently or in tandem with PIL. In addition, we propose a streamlined multi-speaker speech-to-text architecture that leverages Sortformer for speaker supervision, embedding speaker labels into the encoder using sinusoidal kernel functions. This design addresses the speaker permutation problem through sorted objectives, effectively bridging timestamps and tokens to supervise speaker labels in the output transcriptions. Experiments demonstrate that Sort Loss can boost speaker diarization performance, and incorporating the speaker supervision from Sortformer improves multi-speaker transcription accuracy. We anticipate that the proposed Sortformer and multi-speaker architecture will enable the seamless integration of speaker tagging capabilities into foundational speech-to-text systems and multimodal large language models (LLMs), offering an easily adoptable and user-friendly mechanism to enhance their versatility and performance in speaker-aware tasks. The code and trained models are made publicly available through the NVIDIA NeMo Framework.

Paperid:2841 poster

Authors:Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Max Viktor Skalse

Title: The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Abstract:
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue bylearningthe reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has anerror-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for anyfixedexpected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.

Paperid:2842 poster

Authors:Abdelhakim Benechehab, Vasilii Feofanov, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl

Title: AdaPTS: Adapting Univariate Foundation Models to Probabilistic Multivariate Time Series Forecasting

Abstract:
Pre-trained foundation models (FMs) have shown exceptional performance in univariate time series forecasting tasks. However, several practical challenges persist, including managing intricate dependencies among features and quantifying uncertainty in predictions. This study aims to tackle these critical limitations by introducingadapters—feature-space transformations that facilitate the effective use of pre-trained univariate time series FMs for multivariate tasks. Adapters operate by projecting multivariate inputs into a suitable latent space and applying the FM independently to each dimension. Inspired by the literature on representation learning and partially stochastic Bayesian neural networks, we present a range of adapters and optimization/inference strategies. Experiments conducted on both synthetic and real-world datasets confirm the efficacy of adapters, demonstrating substantial enhancements in forecasting accuracy and uncertainty quantification compared to baseline methods. Our framework,AdaPTS, positions adapters as a modular, scalable, and effective solution for leveraging time series FMs in multivariate contexts, thereby promoting their wider adoption in real-world applications. We release the code athttps://github.com/abenechehab/AdaPTS.

Paperid:2843 poster

Authors:Liangchen Liu, Nannan Wang, Xi Yang, Xinbo Gao, Tongliang Liu

Title: Surrogate Prompt Learning: Towards Efficient and Diverse Prompt Learning for Vision-Language Models

Abstract:
Prompt learning is a cutting-edge parameter-efficient fine-tuning technique for pre-trained vision-language models (VLMs). Instead of learning a single text prompt, recent works have revealed that learning diverse text prompts can effectively boost the performances on downstream tasks, as the diverse prompted text features can comprehensively depict the visual concepts from different perspectives. However, diverse prompt learning demands enormous computational resources. This efficiency issue still remains unexplored. To achieve efficient and diverse prompt learning, this paper proposes a novel \textbf{Surrogate Prompt Learning (SurPL)} framework. Instead of learning diverse text prompts, SurPL directly generates the desired prompted text features via a lightweight \textbf{Surrogate Feature Generator (SFG)}, thereby avoiding the complex gradient computation procedure of conventional diverse prompt learning. Concretely, based on a basic prompted text feature, SFG can directly and efficiently generate diverse prompted features according to different pre-defined conditional signals. Extensive experiments indicate the effectiveness of the surrogate prompted text features, and show compelling performances and efficiency of SurPL on various benchmarks.

Paperid:2844 poster

Authors:Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Title: Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints

Abstract:
We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, stradi et al. (2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achievesO~(T)regret and constraint violation, while, when they are adversarial, it attainsO~(T)constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

Paperid:2845 poster

Authors:Yihao Xue, Jiping Li, Baharan Mirzasoleiman

Title: Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

Abstract:
Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.

Paperid:2846 poster

Authors:Juno Kim, Denny Wu, Jason D. Lee, Taiji Suzuki

Title: Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation

Abstract:
A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time computation by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed \emph{metastable representation} of the reasoning dynamics can be distilled into a smaller, more efficient model.

Paperid:2847 poster

Authors:Samar Khanna, Medhanie Irgau, David B. Lobell, Stefano Ermon

Title: ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Abstract:
Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 8% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines such as PEFT. Code is available at:https://samar-khanna.github.io/ExPLoRA/

Paperid:2848 poster

Authors:Xue Zhao, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye

Title: Generalizable Multi-Camera 3D Object Detection from a Single Source via Fourier Cross-View Learning

Abstract:
Improving the generalization of multi-camera 3D object detection is essential for safe autonomous driving in the real world. In this paper, we consider a realistic yet more challenging scenario, which aims to improve the generalization when only single source data available for training, as gathering diverse domains of data and collecting annotations is time-consuming and labor-intensive. To this end, we propose the Fourier Cross-View Learning (FCVL) framework including Fourier Hierarchical Augmentation (FHiAug), an augmentation strategy in the frequency domain to boost domain diversity, and Fourier Cross-View Semantic Consistency Loss to facilitate the model to learn more domain-invariant features from adjacent perspectives. Furthermore, we provide theoretical guarantees via augmentation graph theory. To the best of our knowledge, this is the first study to explore generalizable multi-camera 3D object detection with a single source. Extensive experiments on various testing domains have demonstrated that our approach achieves the best performance across various domain generalization methods.

Paperid:2849 poster

Authors:Quoc Tran-Dinh

Title: Variance-Reduced Forward-Reflected-Backward Splitting Methods for Nonmonotone Generalized Equations

Abstract:
We develop two novel stochastic variance-reduction methods to approximate solutions of a class of nonmonotone [generalized] equations. Our algorithms leverage a new combination of ideas from the forward-reflected-backward splitting method and a class of unbiased variance-reduced estimators. We construct two new stochastic estimators within this class, inspired by the well-known SVRG and SAGA estimators. These estimators significantly differ from existing approaches used in minimax and variational inequality problems. By appropriately choosing parameters, both algorithms achieve state-of-the-art oracle complexity of $\mathcal{O}(n + n^{2/3} \epsilon^{-2})$ for obtaining an $\epsilon$-solution in terms of the operator residual norm for a class of nonmonotone problems, where $n$ is the number of summands and $\epsilon$ signifies the desired accuracy. This complexity aligns with the best-known results in SVRG and SAGA methods for stochastic nonconvex optimization. We test our algorithms on some numerical examples and compare them with existing methods. The results demonstrate promising improvements offered by the new methods compared to their competitors.

Paperid:2850 poster

Authors:Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, Saining Xie

Title: PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Abstract:
Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development. Code is available at this repository:https://github.com/vision-x-nyu/pisa-experiments.

Paperid:2851 poster

Authors:Jin Zhu, Jingyi Li, Hongyi Zhou, Yinan Lin, Zhenhua Lin, Chengchun Shi

Title: Balancing Interference and Correlation in Spatial Experimental Designs: A Causal Graph Cut Approach

Abstract:
This paper focuses on the design of spatial experiments to optimize the amount of information derived from the experimental data and enhance the accuracy of the resulting causal effect estimator. We propose a surrogate function for the mean squared error (MSE) of the estimator, which facilitates the use of classical graph cut algorithms to learn the optimal design. Our proposal offers three key advances: (1) it accommodates moderate to large spatial interference effects; (2) it adapts to different spatial covariance functions; (3) it is computationally efficient. Theoretical results and numerical experiments based on synthetic environments and a dispatch simulator that models a city-scale ridesharing market, further validate the effectiveness of our design. A python implementation of our method is available athttps://github.com/Mamba413/CausalGraphCut.

Paperid:2852 poster

Authors:Kaiwen Wang, Dawen Liang, Nathan Kallus, Wen Sun

Title: A Reductions Approach to Risk-Sensitive Reinforcement Learning with Optimized Certainty Equivalents

Abstract:
We study risk-sensitive RL where the goal is learn a history-dependent policy that optimizes some risk measure of cumulative rewards. We consider a family of risks called the optimized certainty equivalents (OCE), which captures important risk measures such as conditional value-at-risk (CVaR), entropic risk and Markowitz's mean-variance. In this setting, we propose two meta-algorithms: one grounded in optimism and another based on policy gradients, both of which can leverage the broad suite of risk-neutral RL algorithms in an augmented Markov Decision Process (MDP). Via a reductions approach, we leverage theory for risk-neutral RL to establish novel OCE bounds in complex, rich-observation MDPs. For the optimism-based algorithm, we prove bounds that generalize prior results in CVaR RL and that provide the first risk-sensitive bounds for exogenous block MDPs. For the gradient-based algorithm, we establish both monotone improvement and global convergence guarantees under a discrete reward assumption. Finally, we empirically show that our algorithms learn the optimal history-dependent policy in a proof-of-concept MDP, where all Markovian policies provably fail.

Paperid:2853 poster

Authors:Yanchao Tan, Hang Lv, Yunfei Zhan, Guofang Ma, Bo Xiong, Carl Yang

Title: BoxLM: Unifying Structures and Semantics of Medical Concepts for Diagnosis Prediction in Healthcare

Abstract:
Language Models (LMs) have advanced diagnosis prediction by leveraging the semantic understanding of medical concepts in Electronic Health Records (EHRs). Despite these advancements, existing LM-based methods often fail to capture the structures of medical concepts (e.g., hierarchy structure from domain knowledge). In this paper, we propose BoxLM, a novel framework that unifies the structures and semantics of medical concepts for diagnosis prediction. Specifically, we propose a structure-semantic fusion mechanism via box embeddings, which integrates both ontology-driven and EHR-driven hierarchical structures with LM-based semantic embeddings, enabling interpretable medical concept representations. Furthermore, in the box-aware diagnosis prediction module, an evolve-and-memorize patient box learning mechanism is proposed to model the temporal dynamics of patient visits, and a volume-based similarity measurement is proposed to enable accurate diagnosis prediction. Extensive experiments demonstrate that BoxLM consistently outperforms state-of-the-art baselines, especially achieving strong performance in few-shot learning scenarios, showcasing its practical utility in real-world clinical settings.

Paperid:2854 poster

Authors:Zhaoxuan Kan, Husheng Han, shangyi shi, Tenghui Hua, Hang Lu, Xiaowei Li, Jianan Mu, Xing Hu

Title: FicGCN: Unveiling the Homomorphic Encryption Efficiency from Irregular Graph Convolutional Networks

Abstract:
Graph Convolutional Neural Networks (GCNs) have gained widespread popularity in various fields like personal healthcare and financial systems, due to their remarkable performance. Despite the growing demand for cloud-based GCN services, privacy concerns over sensitive graph data remain significant. Homomorphic Encryption (HE) facilitates Privacy-Preserving Machine Learning (PPML) by allowing computations to be performed on encrypted data. However, HE introduces substantial computational overhead, particularly for GCN operations that require rotations and multiplications in matrix products. The sparsity of GCNs offers significant performance potential, but their irregularity introduces additional operations that reduce practical gains. In this paper, we propose FicGCN, a HE-based framework specifically designed to harness the sparse characteristics of GCNs and strike a globally optimal balance between aggregation and combination operations. FicGCN employs a latency-aware packing scheme, a Sparse Intra-Ciphertext Aggregation (SpIntra-CA) method to minimize rotation overhead, and a region-based data reordering driven by local adjacency structure. We evaluated FicGCN on several popular datasets, and the results show that FicGCN achieved the best performance across all tested datasets, with up to a4.10×improvement over the latest design.

Paperid:2855 poster

Authors:Yining Pan, Qiongjie Cui, Xulei Yang, Na Zhao

Title: How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation

Abstract:
LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we proposeImage-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework. In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder’s ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available athttps://github.com/IMPL-Lab/IAL.git.

Paperid:2856 poster

Authors:Cailong Hua, Sivaraman Rajaganapathy, Rebecca A Slick, Joseph Vavra, Joseph M. Muretta, James M. Ervasti, Murti Salapaka

Title: A Physics-Augmented Deep Learning Framework for Classifying Single Molecule Force Spectroscopy Data

Abstract:
Deciphering protein folding and unfolding pathways under tension is essential for deepening our understanding of fundamental biological mechanisms. Such insights hold the promise of developing treatments for a range of debilitating and fatal conditions, including muscular disorders like Duchenne Muscular Dystrophy and neurodegenerative diseases such as Parkinson's disease. Single molecule force spectroscopy (SMFS) is a powerful technique for investigating forces involved in protein domains folding and unfolding. However, SMFS trials often involve multiple protein molecules, necessitating filtering to isolate measurements from single-molecule trials. Currently, manual visual inspection is the primary method for classifying single-molecule data; a process that is both time-consuming and requires significant expertise. Here, we both apply state-of-the-art machine learning models and present a novel deep learning model tailored to SMFS data. The proposed model employs a dual-branch fusion strategy; one branch integrates the physics of protein molecules, and the other operates independently of physical constraints. This model automates the isolation of single-molecule measurements, significantly enhancing data processing efficiency. To train and validate our approach, we developed a physics-based Monte Carlo engine to simulate force spectroscopy datasets, including trials involving single molecules, multiple molecules, and no molecules. Our model achieves state-of-the-art performance, outperforming five baseline methods on both simulated and experimental datasets. It attains nearly 100% accuracy across all simulated datasets and an average accuracy of79.6±5.2% on experimental datasets, using only�?0 training samples, surpassing baseline methods by 11.4%. Notably, even without expert annotations on experimental data, the model achieves an average accuracy of72.0±5.9% when pre-trained on corresponding simulated datasets. With our deep learning approach, the time required to extract meaningful statistics from single-molecule SMFS trials is reduced from a day to under an hour. This work results in SMFS experimental datasets from four important protein molecules crucial to many biological pathways. To support further research, we have made our datasets publicly available and provided a Python-based toolbox (https://github.com/SalapakaLab-SIMBioSys/SMFS-Identification).

Paperid:2857 poster

Authors:Xinxing Shi, Xiaoyu Jiang, Mauricio A Álvarez

Title: Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling

Abstract:
Gaussian Process (GP) Variational Autoencoders (VAEs) extend standard VAEs by replacing the fully factorised Gaussian prior with a GP prior, thereby capturing richer correlations among latent variables. However, performing exact GP inference in large-scale GPVAEs is computationally prohibitive, often forcing existing approaches to rely on restrictive kernel assumptions or large sets of inducing points. In this work, we propose a neighbour-driven approximation strategy that exploits local adjacencies in the latent space to achieve scalable GPVAE inference. By confining computations to the nearest neighbours of each data point, our method preserves essential latent dependencies, allowing more flexible kernel choices and mitigating the need for numerous inducing points. Through extensive experiments on tasks including representation learning, data imputation, and conditional generation, we demonstrate that our approach outperforms other GPVAE variants in both predictive performance and computational efficiency.

Paperid:2858 poster

Authors:Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei

Title: VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Abstract:
Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduceVistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i)Instance Level, aligning overall video content with responses; ii)Temporal Level, aligning video temporal semantics with event descriptions; and iii)Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we constructVistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination.

Paperid:2859 poster

Authors:Kuan-Po Huang, Shu-wen Yang, HUY PHAN, Bo-Ru Lu, Byeonggeun Kim, Sashank Macha, Qingming Tang, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang

Title: IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling

Abstract:
Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fréchet Distance (FD) and Fréchet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available athttps://audio-impact.github.io/.

Paperid:2860 poster

Authors:Yuena Lin, Haichun Cai, Jun-Yi Hang, Haobo Wang, Zhen Yang, Gengyu Lyu

Title: Mitigating Local Cohesion and Global Sparseness in Graph Contrastive Learning with Fuzzy Boundaries

Abstract:
Graph contrastive learning (GCL) aims at narrowing positives while dispersing negatives, often causing a minority of samples with great similarities to gather as a small group. It results in two latent shortcomings in GCL: 1)local cohesionthat a class cluster contains numerous independent small groups, and 2)global sparsenessthat these small groups (or isolated samples) dispersedly distribute among all clusters. These shortcomings make the learned distributiononly focus on local similarities among partial samples, which hinders the ability to capture the ideal global structural properties among real clusters, especially high intra-cluster compactness and inter-cluster separateness. Considering this, we design a novel fuzzy boundary by extending the original cluster boundary with fuzzy set theory, which involves fuzzy boundary construction and fuzzy boundary contraction to address these shortcomings. The fuzzy boundary construction dilates the original boundaries to bridge the local groups, and the fuzzy boundary contraction forces the dispersed samples or groups within the fuzzy boundary to gather tightly, jointly mitigating local cohesion and global sparseness while forming the ideal global structural distribution. Extensive experiments demonstrate that a graph auto-encoder with the fuzzy boundary significantly outperforms current state-of-the-art GCL models in both downstream tasks and quantitative analysis.

Paperid:2861 poster

Authors:Chuang LIU, Hongyan Xu, Yichao Cao, Xiu Su, Zhe Qu, Tianfa Li, Shan An, Haogang Zhu

Title: TinyMIG: Transferring Generalization from Vision Foundation Models to Single-Domain Medical Imaging

Abstract:
Medical imaging faces significant challenges in single-domain generalization (SDG) due to the diversity of imaging devices and the variability among data collection centers. To address these challenges, we propose \textbf{TinyMIG}, a framework designed to transfer generalization capabilities from vision foundation models to medical imaging SDG. TinyMIG aims to enable lightweight specialized models to mimic the strong generalization capabilities of foundation models in terms of both global feature distribution and local fine-grained details during training. Specifically, for global feature distribution, we propose a Global Distribution Consistency Learning strategy that mimics the prior distributions of the foundation model layer by layer. For local fine-grained details, we further design a Localized Representation Alignment method, which promotes semantic alignment and generalization distillation between the specialized model and the foundation model. These mechanisms collectively enable the specialized model to achieve robust performance in diverse medical imaging scenarios. Extensive experiments on large-scale benchmarks demonstrate that TinyMIG, with extremely low computational cost, significantly outperforms state-of-the-art models, showcasing its superior SDG capabilities. All the code and model weights will be publicly available.

Paperid:2862 poster

Authors:Xin Xu, Qiyun Xu, Tong Xiao, Tianhao Chen, Yuchen Yan, Jiaxin ZHANG, Shizhe Diao, Can Yang, Yang Wang

Title: UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models

Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs�?abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and diverse benchmark specifically designed to evaluateUnderGraduate-levelPhysics(UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese across 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the need for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning. Codes and data are available at \href{https://github.com/YangLabHKUST/UGPhysics}{https://github.com/YangLabHKUST/UGPhysics}.

Paperid:2863 poster

Authors:Chiqiang Liu, Dazi Li

Title: HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning

Abstract:
Cooperative multi-agent reinforcement learning faces significant challenges in effectively organizing agent relationships and facilitating information exchange, particularly when agents need to adapt their coordination patterns dynamically. This paper presents a novel framework that integrates dynamic spectral clustering with hypergraph neural networks to enable adaptive group formation and efficient information processing in multi-agent systems. The proposed framework dynamically constructs and updates hypergraph structures through spectral clustering on agents' state histories, enabling higher-order relationships to emerge naturally from agent interactions. The hypergraph structure is enhanced with attention mechanisms for selective information processing, providing an expressive and efficient way to model complex agent relationships. This architecture can be implemented in both value-based and policy-based paradigms through a unified objective combining task performance with structural regularization. Extensive experiments on challenging cooperative tasks demonstrate that our method significantly outperforms state-of-the-art approaches in both sample efficiency and final performance. The code is available at:https://github.com/mysteryelder/HYGMA.

Paperid:2864 poster

Authors:Shelvia Wongso, Rohan Ghosh, Mehul Motani

Title: Pointwise Information Measures as Confidence Estimators in Deep Neural Networks: A Comparative Study

Abstract:
Estimating the confidence of deep neural network predictions is crucial for safe deployment in high-stakes applications. While softmax probabilities are commonly used, they are often poorly calibrated, and existing calibration methods have been shown to be detrimental to failure prediction. In this paper, we propose using information-theoretic measures to estimate prediction confidence in a post-hoc manner, without modifying network architecture or training. Specifically, we compare three pointwise information (PI) measures: pointwise mutual information (PMI), pointwise V-information (PVI), and the recently proposed pointwise sliced mutual information (PSI). These measures are theoretically grounded in their relevance to predictive uncertainty, with properties such as invariance, convergence rates, and sensitivity to geometric attributes like margin and intrinsic dimensionality. Through extensive experiments on benchmark computer vision models and datasets, we find that PVI consistently outperforms or matches existing baselines in post-hoc confidence estimation, excelling over PMI and PSI in both failure prediction and calibration. This aligns with our theoretical insights, which suggest that PVI offers the most balanced trade-offs. Finally, we show that replacing softmax outputs with PVI in existing confidence estimation methods can further enhance performance.

Paperid:2865 poster

Authors:Ruiqi Zhang, Jingfeng Wu, Peter Bartlett

Title: Gradient Descent Converges Arbitrarily Fast for Logistic Regression via Large and Adaptive Stepsizes

Abstract:
We analyze the convergence of gradient descent (GD) with large, adaptive stepsizes for logistic regression on linearly separable data. The stepsize adapts to the current risk, scaled by a fixed base stepsize \eta. We prove that once the number of iterates t surpasses a margin-dependent threshold, the averaged GD iterate achieves a risk upper bound of \exp(-\Theta(\eta t)), where \eta can be chosen arbitrarily large. This implies that GD attains \emph{arbitrarily fast} convergence rates via large stepsizes, although the risk evolution might not be monotonic. In contrast, prior adaptive stepsize GD analyses require a monotonic risk decrease, limiting their rates to \exp(-\Theta(t)). We further establish a margin-dependent lower bound on the iteration complexity for any first-order method to attain a small risk, justifying the necessity of the burn-in phase in our analysis. Our results generalize to a broad class of loss functions and two-layer networks under additional assumptions.

Paperid:2866 poster

Authors:Seungbeom Lee, Munsun Jo, Jungseul Ok, Dongwoo Kim

Title: Enhancing Ligand Validity and Affinity in Structure-Based Drug Design with Multi-Reward Optimization

Abstract:
Deep learning-based Structure-based drug design aims to generate ligand molecules with desirable properties for protein targets. While existing models have demonstrated competitive performance in generating ligand molecules, they primarily focus on learning the chemical distribution of training datasets, often lacking effective steerability to ensure the desired chemical quality of generated molecules. To address this issue, we propose a multi-reward optimization framework that fine-tunes generative models for attributes, such as binding affinity, validity, and drug-likeness, together. Specifically, we derive direct preference optimization for a Bayesian flow network, used as a backbone for molecule generation, and integrate a reward normalization scheme to adopt multiple objectives. Experimental results show that our method generates more realistic ligands than baseline models while achieving higher binding affinity, expanding the Pareto front empirically observed in previous studies.

Paperid:2867 poster

Authors:Haoyuan Qin, Zhengzhu Liu, Chenxing Lin, Chennan Ma, Songzhu Mei, Siqi Shen, Cheng Wang

Title: GradPS: Resolving Futile Neurons in Parameter Sharing Network for Multi-Agent Reinforcement Learning

Abstract:
Parameter-sharing (PS) techniques have been widely adopted in cooperative Multi-Agent Reinforcement Learning (MARL). In PS, all the agents share a policy network with identical parameters, which enjoys good sample efficiency. However, PS could lead to homogeneous policies that limit MARL performance. We tackle this problem from the angle of gradient conflict among agents. We find that the existence of futile neurons whose update is canceled out by gradient conflicts among agents leads to poor learning efficiency and diversity. To address this deficiency, we propose GradPS, a gradient-based PS method. It dynamically creates multiple clones for each futile neuron. For each clone, a group of agents with low gradient-conflict shares the neuron's parameters. Our method can enjoy good sample efficiency by sharing the gradients among agents of the same clone neuron. Moreover, it can encourage diverse behaviors through independently updating an exclusive clone neuron. Through extensive experiments, we show that GradPS can learn diverse policies with promising performance. The source code for GradPS is available in \url{https://github.com/xmu-rl-3dv/GradPS}.

Paperid:2868 poster

Authors:Ryotaro Kawata, Kohsei Matsutani, Yuri Kinoshita, Naoki Nishikawa, Taiji Suzuki

Title: Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning

Abstract:
Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical understanding of this architecture is falling behind due to its inherent complexity. In this paper, we theoretically study the sample and runtime complexity of MoE following the stochastic gradient descent when learning a regression task with an underlying cluster structure of single index models. On the one hand, we show that a vanilla neural network fails in detecting such a latent organization as it can only process the problem as a whole. This is intrinsically related to the concept ofinformation exponentwhich is low for each cluster, but increases when we consider the entire task. On the other hand, with a MoE, we show that it succeeds in dividing the problem into easier subproblems by leveraging the ability of each expert to weakly recover the simpler function corresponding to an individual cluster. To the best of our knowledge, this work is among the first to explore the benefits of the MoE framework by examining its SGD dynamics in the context of nonlinear regression.

Paperid:2869 poster

Authors:Suning Huang, Zheyu Aqa Zhang, Tianhai Liang, Yihan Xu, Zhehao Kou, Chenhao Lu, Guowei Xu, Zhengrong Xue, Huazhe Xu

Title: MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning

Abstract:
Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both thearchitectureandoptimizationof RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone and introduces a task-oriented perturbation mechanism. MENTOR outperforms state-of-the-art methods across three simulation benchmarks and achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks, significantly surpassing the 32% success rate of the strongest existing model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available athttps://suninghuang19.github.io/mentor_page/.

Paperid:2870 poster

Authors:Spandan Pyakurel, Qi Yu

Title: Learning State-Based Node Representations from a Class Hierarchy for Fine-Grained Open-Set Detection

Abstract:
Fine-Grained Openset Detection (FGOD) poses a fundamental challenge due to the similarity between the openset classes and those closed-set ones. Since real-world objects/entities tend to form a hierarchical structure, the fine-grained relationship among the closed-set classes as captured by the hierarchy could potentially improve the FGOD performance. Intuitively, the hierarchical dependency among different classes allows the model to recognize their subtle differences, which in turn makes it better at differentiating similar open-set classes even they may share the same parent. However, simply performing openset detection in a top-down fashion by building a local detector for each node may result in a poor detection performance. Our theoretical analysis also reveals that maximizing the probability of the path leading to the ground-truth leaf node also results in a sub-optimal training process. To systematically address this issue, we propose to formulate a novel state-based node representation, which constructs a state space based upon the entire hierarchical structure. We prove that the state-based representation guarantees to maximize the probability on the path leading to the ground-truth leaf node. Extensive experiments on multiple real-world hierarchical datasets clearly demonstrate the superior performance of the proposed method.

Paperid:2871 poster

Authors:Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu

Title: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Abstract:
State Space Models (SSMs) are gaining attention as an efficient alternative to Transformers due to their constant memory complexity and comparable performance. Yet, deploying large-scale SSMs on cloud-based services or resource-constrained devices faces challenges. To address this, quantizing SSMs using low bit-width data types is proposed to reduce model size and leverage hardware acceleration. Given that SSMs are sensitive to quantization errors, recent advancements focus on quantizing a specific model or bit-width to improve their efficiency while maintaining performance. However, different bit-width configurations, such as W4A8 for cloud service throughput and W4A16 for improving question-answering on personal devices, are necessary for specific scenarios. To this end, we present Quamba2, compatible with \textbf{W8A8}, \textbf{W4A8}, and \textbf{W4A16} for both \textbf{Mamba} and \textbf{Mamba2}, addressing the rising demand for SSM deployment across various platforms. We propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering forx, combined with a per-state-group quantization forBandC. To ensure compute-invariance in the SSM output, we offline rearrange weights according to the clustering sequence. The experiments show Quamba2-8B outperforms several state-of-the-art SSMs quantization methods and delivers 1.3×and 3×speedup in the pre-filling and generation stages and 4×memory reduction with only a1.6% accuracy drop on average. The code and quantized models will be released at:

Paperid:2872 poster

Authors:Mehmet Yiğit Balık, Maksim Sinelnikov, Priscilla Ong, Harri Lähdesmäki

Title: Bayesian Basis Function Approximation for Scalable Gaussian Process Priors in Deep Generative Models

Abstract:
High-dimensional time-series datasets are common in domains such as healthcare and economics. Variational autoencoder (VAE) models, where latent variables are modeled with a Gaussian process (GP) prior, have become a prominent model class to analyze such correlated datasets. However, their applications are challenged by the inherent cubic time complexity that requires specific GP approximation techniques, as well as the general challenge of modeling both shared and individual-specific correlations across time. Though inducing points enhance GP prior VAE scalability, optimizing them remains challenging, especially since discrete covariates resist gradient‑based methods. In this work, we propose a scalable basis function approximation technique for GP prior VAEs that mitigates these challenges and results in linear time complexity, with a global parametrization that eliminates the need for amortized variational inference and the associated amortization gap, making it well-suited for conditional generation tasks where accuracy and efficiency are crucial. Empirical evaluations on synthetic and real-world benchmark datasets demonstrate that our approach not only improves scalability and interpretability but also drastically enhances predictive performance.

Paperid:2873 poster

Authors:Xiuyi Jia, Jinchi Li, Yunan Lu, Weiwei Li

Title: LIMEFLDL: A Local Interpretable Model-Agnostic Explanations Approach for Label Distribution Learning

Abstract:
Label distribution learning (LDL) is a novel machine learning paradigm that can handle label ambiguity. This paper focuses on the interpretability issue of label distribution learning. Existing local interpretability models are mainly designed for single-label learning problems and are difficult to directly interpret label distribution learning models. In response to this situation, we propose an improved local interpretable model-agnostic explanations algorithm that can effectively interpret any black-box model in label distribution learning. To address the label dependency problem, we introduce the feature attribution distribution matrix and derive the solution formula for explanations under the label distribution form. Meanwhile, to enhance the transparency and trustworthiness of the explanation algorithm, we provide an analytical solution and derive the boundary conditions for explanation convergence and stability. In addition, we design a feature selection scoring function and a fidelity metric for the explanation task of label distribution learning. A series of numerical experiments and human experiments were conducted to validate the performance of the proposed algorithm in practical applications. The experimental results demonstrate that the proposed algorithm achieves high fidelity, consistency, and trustworthiness in explaining LDL models.

Paperid:2874 poster

Authors:Yaoqin He, Junchen Fu, Kaiwen Zheng, Songpei Xu, Fuhai Chen, Jie Li, Joemon M. Jose, Xuri Ge

Title: Double-Filter: Efficient Fine-tuning of Pre-trained Vision-Language Models via Patch&Layer Filtering

Abstract:
In this paper, we present a novel approach, termed Double-Filter,to “slim down�?the fine-tuning process of vision-language pre-trained (VLP) models via filtering redundancies in feature inputs and architectural components. We enhance the fine-tuning process using two approaches. First, we develop a new patch selection method incorporating image patch filtering through background and foreground separation, followed by a refined patch selection process. Second, we design a genetic algorithm to eliminate redundant fine-grained architecture layers, improving the efficiency and effectiveness of the model. The former makes patch selection semantics more comprehensive, improving inference efficiency while ensuring semantic representation. The latter’s fine-grained layer filter removes architectural redundancy to the extent possible and mitigates the impact on performance. Experimental results demonstrate that the proposed Double-Filter achieves superior efficiency of model fine-tuning and maintains competitive performance compared with the advanced efficient fine-tuning methods on three downstream tasks, VQA, NLVR and Retrieval. In addition, it has been proven to be effective under METER and ViLT VLP models.

Paperid:2875 poster

Authors:Zhaohong Huang, Yuxin Zhang, JingJing Xie, Fei Chao, Rongrong Ji

Title: GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models

Abstract:
Recent advances in test-time adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image’s spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT's memory usage on ImageNet.

Paperid:2876 poster

Authors:Jiaru Qian, Guancheng Wan, Wenke Huang, Guibin Zhang, Yuxin Wu, Bo Du, Mang Ye

Title: GHOST: Generalizable One-Shot Federated Graph Learning with Proxy-Based Topology Knowledge Retention

Abstract:
Federated Graph Learning (FGL) proposes an effective approach to collaboratively training Graph Neural Networks (GNNs) while maintaining privacy. Nevertheless, communication efficiency becomes a critical bottleneck in environments with limited resources. In this context, one-shot FGL emerges as a promising solution by restricting communication to a single round. However, prevailing FGL methods face two key challenges in the one-shot setting: 1) They heavily rely on gradual personalized optimization over multiple rounds, undermining the capability of the global model to efficiently generalize across diverse graph structures. 2) They are prone to overfitting to local data distributions due to extreme structural bias, leading to catastrophic forgetting. To address these issues, we introduceGHOST, an innovative one-shot FGL framework. In GHOST, we establish a proxy model for each client to leverage diverse local knowledge and integrate it to train the global model. During training, we identify and consolidate parameters essential for capturing topological knowledge, thereby mitigating catastrophic forgetting. Extensive experiments on real-world tasks demonstrate the superiority and generalization capability of GHOST. The code is available athttps://github.com/JiaruQian/GHOST.

Paperid:2877 poster

Authors:Guoxin Chen, Minpeng Liao, Peiying Yu, Dingmin Wang, Zile Qiao, Chao Yang, Xin Zhao, Kai Fan

Title: C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

Abstract:
Retrieval-augmented generation (RAG) systems face a fundamental challenge in aligning independently developed retrievers and large language models (LLMs). Existing approaches typically involve modifying either component or introducing simple intermediate modules, resulting in practical limitations and sub-optimal performance. Inspired by human search behavior—typically involving a back-and-forth process of proposing search queries and reviewing documents, we propose C-3PO, a proxy-centric framework that facilitates communication between retrievers and LLMs through a lightweight multi-agent system. Our framework implements three specialized agents that collaboratively optimize the entire RAG pipeline without altering the retriever and LLMs. These agents work together to assess the need for retrieval, generate effective queries, and select information suitable for the LLMs. To enable effective multi-agent coordination, we develop a tree-structured rollout approach for reward credit assignment in reinforcement learning. Extensive experiments in both in-domain and out-of-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities.

Paperid:2878 poster

Authors:Liming Shen, Liang Deng, Chongke Bi, Yu Wang, Xinhai Chen, Yueqing Wang, Jie Liu

Title: PEINR: A Physics-enhanced Implicit Neural Representation for High-Fidelity Flow Field Reconstruction

Abstract:
Implicit neural representation (INR) has now been thrust into the limelight with its flexibility in high-fidelity flow field reconstruction tasks. However, the lack of standard benchmarking datasets and the grid independence assumption for INR-based methods hinder progress and adoption in real-world simulation scenarios. Moreover, naive adoptions of existing INR frameworks suffer from limited accuracy in capturing fine-scale structures and spatiotemporal dynamics. Tacking these issues, we first introduce HFR-Beach, a 5.4 TB public large-scale CFD dataset with 33,600 unsteady 2D and 3D vector fields for reconstructing high-fidelity flow fields. We further present PEINR, a physics-enhanced INR framework, to enrich the flow fields by concurrently enhancing numerical-precision and grid-resolution. Specifically, PEINR is mainly composed of physical encoding and transformer-based spatiotemporal fuser (TransSTF). Physical encoding decouples temporal and spatial components, employing Gaussian coordinate encoding and localized encoding techniques to capture the nonlinear characteristics of spatiotemporal dynamics and the stencil discretization of spatial dimensions, respectively. TransSTF fuses both spatial and temporal information via transformer for capturing long-range temporal dependencies. Qualitative and quantitative experiments and demonstrate that PEINR outperforms state-of-the-art INR-based methods in reconstruction quality.

Paperid:2879 poster

Authors:Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Yu Haining, Xiaohua Jia

Title: The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Abstract:
Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective.

Paperid:2880 poster

Authors:Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia wei, Jun Zhu, Jianfei Chen

Title: SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Abstract:
Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes(Q,K)to INT4 in a hardware-friendly thread-level granularity and quantize matrixes(P~,V)to FP8. Second, we propose a method to smoothQ, enhancing the accuracy of INT4QK�? Third, we propose a two-level accumulation strategy forP~Vto enhance the accuracy of FP8P~V. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about3xand4.5x, respectively. Moreover, SageAttention2 matches the speed of FlashAttention3(fp8) on the Hopper GPUs, while delivering much higher accuracy. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for language, image, and video generation.

Paperid:2881 poster

Authors:Songlin Xu, Xinyu Zhang

Title: CogReact: A Reinforced Framework to Model Human Cognitive Reaction Modulated by Dynamic Intervention

Abstract:
Using deep neural networks as computational models to simulate cognitive processes can provide key insights into human behavioral dynamics. Challenges arise when environments are highly dynamic, obscuring stimulus-behavior relationships. However, the majority of current research focuses on simulating human cognitive behaviors under ideal conditions, neglecting the influence of environmental disturbances. We propose CogReact, which integrates drift-diffusion with deep reinforcement learning to simulate granular effects of dynamic environmental stimuli on the human cognitive process. Quantitatively, it improves cognition modeling by considering the temporal effect of environmental stimuli on the cognitive process and captures both subject-specific and stimuli-specific behavioral differences. Qualitatively, it captures general trends in the human cognitive process under stimuli. We examine our approach under diverse environmental influences across various cognitive tasks. Overall, it demonstrates a powerful, data-driven methodology to simulate, align with, and understand the vagaries of human cognitive response in dynamic contexts.

Paperid:2882 poster

Authors:Yuya Hikima, Hiroshi Sawada, Akinori Fujino

Title: Guided Zeroth-Order Methods for Stochastic Non-convex Problems with Decision-Dependent Distributions

Abstract:
In this study, we tackle an optimization problem with a known function and an unknown decision-dependent distribution, which arises in a variety of applications and is often referred to as a performative prediction problem. To solve the problem, several zeroth-order methods have been developed because the gradient of the objective function cannot be obtained explicitly due to the unknown distribution. Although these methods have theoretical convergence, they cannot utilize the information on the known function, which limits their efficiency in reducing the objective value. To overcome this issue, we propose new zeroth-order methods that generate effective update directions by utilizing information on the known function. As theoretical results, we show the convergence of our methods to stationary points and provide the worst-case sample complexity analysis, which improves the state of the arts when the maximum objective value dominates the cube root of the decision variable's dimensionality in order. Our simulation experiments on multiple applications show that our methods output solutions with lower objective values than the existing zeroth-order methods do.

Paperid:2883 poster

Authors:Victor Dheur, Matteo Fontana, Yorick Estievenart, Naomi Desobry, Souhaib Ben Taieb

Title: A Unified Comparative Study with Generalized Conformity Scores for Multi-Output Conformal Regression

Abstract:
Conformal prediction provides a powerful framework for constructing distribution-free prediction regions with finite-sample coverage guarantees. While extensively studied in univariate settings, its extension to multi-output problems presents additional challenges, including complex output dependencies and high computational costs, and remains relatively underexplored. In this work, we present a unified comparative study of nine conformal methods with different multivariate base models for constructing multivariate prediction regions within the same framework. This study highlights their key properties while also exploring the connections between them. Additionally, we introduce two novel classes of conformity scores for multi-output regression that generalize their univariate counterparts. These scores ensure asymptotic conditional coverage while maintaining exact finite-sample marginal coverage. One class is compatible with any generative model, offering broad applicability, while the other is computationally efficient, leveraging the properties of invertible generative models. Finally, we conduct a comprehensive empirical evaluation across 13 tabular datasets, comparing all the multi-output conformal methods explored in this work. To ensure a fair and consistent comparison, all methods are implemented within a unified code base.

Paperid:2884 poster

Authors:Austin A. Nguyen, Anri Gu, Michael P. Wellman

Title: Explicit Exploration for High-Welfare Equilibria in Game-Theoretic Multiagent Reinforcement Learning

Abstract:
Iterative extension of empirical game models through deep reinforcement learning (RL) has proved an effective approach for finding equilibria in complex games. When multiple equilibria exist, we may also be interested in finding solutions with particular characteristics. We address this issue of equilibrium selection in the context of Policy Space Response Oracles (PSRO), a flexible game-solving framework based on deep RL, by skewing the strategy exploration process towards higher-welfare solutions. At each iteration, we create an exploration policy that imitates high welfare-yielding behavior and train a response to the current solution, regularized to be similar to the exploration policy. With no additional simulation expense, our approach, named Ex$^2$PSRO, tends to find higher welfare equilibria than vanilla PSRO in two benchmarks: a sequential bargaining game and a social dilemma game. Further experiments demonstrate Ex$^2$PSRO's composability with other PSRO variants and illuminate the relationship between exploration policy choice and algorithmic performance.

Paperid:2885 poster

Authors:Yujun Kim, Jaeyoung Cha, Chulhee Yun

Title: Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems

Abstract:
Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $\kappa$. In contrast, little is known when $K$ is smaller than $\kappa$, and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime.

Paperid:2886 poster

Authors:Hongyang Lei, Xiaolong Cheng, Qi Qin, Dan Wang, Huazhen Huang, Qingqing Gu, Yetao Wu, Luo Ji

Title: M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Abstract:
Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.

Paperid:2887 poster

Authors:wang ning, Zekun Li, Tongxin Bai, Man Yao, Zhen Qin, Guoqi Li

Title: MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition

Abstract:
Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using ahigh-order QK integration theoryand amulti-level vocabulary decomposition. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt asoft integration strategywith attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1.2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.

Paperid:2888 poster

Authors:Gül Sena Altıntaş, Devin Kwok, Colin Raffel, David Rolnick

Title: The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Abstract:
Neural network training is inherently sensitive to initialization and the randomness induced by stochastic gradient descent. However, it is unclear to what extent such effects lead to meaningfully different networks, either in terms of the models' weights or the underlying functions that were learned. In this work, we show that during the initial "chaotic" phase of training, even extremely small perturbations reliably causes otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time. We quantify this divergence through (i) $L^2$ distance between parameters, (ii) the loss barrier when interpolating between networks, (iii) $L^2$ and barrier between parameters after permutation alignment, and (iv) representational similarity between intermediate activations; revealing how perturbations across different hyperparameter or fine-tuning settings drive training trajectories toward distinct loss minima. Our findings provide insights into neural network training stability, with practical implications for fine-tuning, model merging, and diversity of model ensembles.

Paperid:2889 poster

Authors:Jun Chen, Hong Chen, Yonghua Yu, Yiming Ying

Title: How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction

Abstract:
In recent years, contrastive learning has achieved state-of-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely on a default assumption, i.e., the label consistency assumption, which may not hold in practice (the probability of failure is called labeling error) due to the strength and randomness of common augmentation strategies, such as random resized crop (RRC). This paper investigates the theoretical impact of labeling error on the downstream classification performance of contrastive learning. We first reveal several significant negative impacts of labeling error on downstream classification risk. To mitigate these impacts, data dimensionality reduction method (e.g., singular value decomposition, SVD) is applied on original data to reduce false positive samples, and establish both theoretical and empirical evaluations. Moreover, it is also found that SVD acts as a double-edged sword, which may lead to the deterioration of downstream classification accuracy due to the reduced connectivity of the augmentation graph. Based on the above observations, we give the augmentation suggestion that we should use some moderate embedding dimension (such as $512, 1024$ in our experiments), data inflation, weak augmentation, and SVD to ensure large graph connectivity and small labeling error to improve model performance.

Paperid:2890 poster

Authors:Yilun Zhou, Austin Xu, PeiFeng Wang, Caiming Xiong, Shafiq Joty

Title: Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Abstract:
Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.

Paperid:2891 poster

Authors:Ngoc Bui, Menglin Yang, Runjin Chen, Leonardo Neves, Mingxuan Ju, Rex Ying, Neil Shah, Tong Zhao

Title: Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning

Abstract:
Backward compatible representation learning enables updated models to integrate seamlessly with existing ones, avoiding to reprocess stored data. Despite recent advances, existing compatibility approaches in Euclidean space neglect the uncertainty in the old embedding models and force the new model to replicate outdated representations regardless of their quality, and thereby hindering the learning process. In this paper, we switch perspectives to hyperbolic geometry, where we treat time as a natural axis for capturing a model’s confidence and evolution. By lifting embeddings into hyperbolic space and constraining updated embeddings to lie within the entailment cone of the old ones, we maintain generational consistency across models while accounting for uncertainties in the representations. To further enhance compatibility, we introduce a robust contrastive alignment loss that dynamically adjusts alignment weights based on the uncertainty of the old embeddings. Experiments validate the superiority of the proposed method in achieving compatibility, paving the way for more resilient and adaptable machine learning systems.

Paperid:2892 poster

Authors:Yuzhong Hong, Hanshan Zhang, Junwei Bao, Hongfei Jiang, yang song

Title: Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Abstract:
Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently, the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based preference model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To showcase the practical utility of replacing BTM with our EBM in the context of offline alignment, we adapt a simple yet scalable objective function from the recent literature on fitting EBM and name it as Energy Preference Alignment (EPA). Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby validating the theoretical superiority of our EBM.

Paperid:2893 poster

Authors:Aojun Lu, Hangjie Yuan, Tao Feng, Yanan Sun

Title: Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective

Abstract:
The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. While numerous CL methods aim to achieve this trade-off, they often overlook the impact of network architecture on stability and plasticity, restricting the trade-off to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Arch, which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments demonstrate that Dual-Arch enhances the performance of existing CL methods while being up to 87% more compact in terms of parameters.

Paperid:2894 poster

Authors:Hongyi Wan, Shiyuan Ren, Wei Huang, Miao Zhang, Xiang Deng, Yixin Bao, Liqiang Nie

Title: Understanding the Forgetting of (Replay-based) Continual Learning via Feature Learning: Angle Matters

Abstract:
Continual learning (CL) is crucial for advancing human-level intelligence, but its theoretical understanding, especially regarding factors influencing forgetting, is still relatively limited. This work aims to build a unified theoretical framework for understanding CL using feature learning theory. Different from most existing studies that analyze forgetting under linear regression model or lazy training, we focus on a more practical two-layer convolutional neural network (CNN) with polynomial ReLU activation for sequential tasks within a signal-noise data model. Specifically, we theoretically reveal how the angle between task signal vectors influences forgetting that:acute or small obtuse angles lead to benign forgetting, whereas larger obtuse angles result in harmful forgetting. Furthermore, we demonstrate that the replay method alleviates forgetting by expanding the range of angles corresponding to benign forgetting. Our theoretical results suggest that mid-angle sampling, which selects examples with moderate angles to the prototype, can enhance the replay method's ability to mitigate forgetting. Experiments on synthetic and real-world datasets confirm our theoretical results and highlight the effectiveness of our mid-angle sampling strategy.

Paperid:2895 poster

Authors:Fabian Christian Spaeh, Atsushi Miyauchi

Title: An Asymptotically Optimal Approximation Algorithm for Multiobjective Submodular Maximization at Scale

Abstract:
Maximizing a single submodular set function subject to a cardinality constraint is a well-studied and central topic in combinatorial optimization. However, finding a set that maximizes multiple functions at the same time is much less understood, even though it is a formulation which naturally occurs in robust maximization or problems with fairness considerations such as fair influence maximization or fair allocation. In this work, we consider the problem of maximizing the minimum over many submodular functions subject to a cardinality constraint, which is known as multiobjective submodular maximization. All known polynomial-time approximation algorithms either obtain a weak approximation guarantee or rely on the evaluation of the multilinear extension. The latter is expensive to evaluate and renders such algorithms impractical. We bridge this gap and introduce the first scalable and practical algorithm that obtains the best-known approximation guarantee. We furthermore introduce a novel application fair centrality maximization and show how it can be addressed via multiobjective submodular maximization. In our experimental evaluation, we show that our algorithm outperforms known algorithms in terms of objective value and running time.

Paperid:2896 poster

Authors:Yi Xie, Jaedong Hwang, Carlos D Brody, David W. Tank, Ila R Fiete

Title: A Multi-Region Brain Model to Elucidate the Role of Hippocampus in Spatially Embedded Decision-Making

Abstract:
Brains excel at robust decision-making and data-efficient learning. Understanding the architectures and dynamics underlying these capabilities can inform inductive biases for deep learning. We present a multi-region brain model that explores the normative role of structured memory circuits in a spatially embedded binary decision-making task from neuroscience. We counterfactually compare the learning performance and neural representations of reinforcement learning (RL) agents with brain models of different interaction architectures between grid and place cells in the entorhinal cortex and hippocampus, coupled with an action-selection cortical recurrent neural network. We demonstrate that a specific architecture--where grid cells receive and jointly encode self-movement velocity signals and decision evidence increments--optimizes learning efficiency while best reproducing experimental observations relative to alternative architectures. Our findings thus suggest brain-inspired structured architectures for efficient RL. Importantly, the models make novel, testable predictions about organization and information flow within the entorhinal-hippocampal-neocortical circuit: we predict that grid cells must conjunctively encode position and evidence for effective spatial decision-making, directly motivating new neurophysiological experiments.

Paperid:2897 poster

Authors:Peihua Mai, Youlong Ding, Ziyan Lyu, Minxin Du, Yan Pang

Title: SecEmb: Sparsity-Aware Secure Federated Learning of On-Device Recommender System with Large Embedding

Abstract:
Federated recommender system (FedRec) has emerged as a solution to protect user data through collaborative training techniques. A typical FedRec involves transmitting the full model and entire weight updates between edge devices and the server, causing significant burdens to edge devices with limited bandwidth and computational power. The sparsity of embedding updates provides opportunity for payload optimization, while existing sparsity-aware federated protocols generally sacrifice privacy for efficiency. A key challenge in designing a secure sparsity-aware efficient protocol is to protect the rated item indices from the server. In this paper, we propose a lossless secure recommender systems with on sparse embedding updates (SecEmb). SecEmb reduces user payload while ensuring that the server learns no information about both rated item indices and individual updates except the aggregated model. The protocol consists of two correlated modules: (1) a privacy-preserving embedding retrieval module that allows users to download relevant embeddings from the server, and (2) an update aggregation module that securely aggregates updates at the server. Empirical analysis demonstrates that SecEmb reduces both download and upload communication costs by up to 90x and decreases user-side computation time by up to 70x compared with secure FedRec protocols. Additionally, it offers non-negligible utility advantages compared with lossy message compression methods.

Paperid:2898 poster

Authors:Yuhang Li, Tong Liu, Yangguang Cui, Ming Hu, Xiaoqiang Li

Title: One Arrow, Two Hawks: Sharpness-aware Minimization for Federated Learning via Global Model Trajectory

Abstract:
Federated learning (FL) presents a promising strategy for distributed and privacy-preserving learning, yet struggles with performance issues in the presence of heterogeneous data distributions. Recently, a series of works based on sharpness-aware minimization (SAM) have emerged to improve local learning generality, proving to be effective in mitigating data heterogeneity effects. However, most SAM-based methods do not directly consider the global objective and require two backward pass per iteration, resulting in diminished effectiveness. To overcome these two bottlenecks, we leverage the global model trajectory to directly measure sharpness for the global objective, requiring only a single backward pass. We further propose a novel and general algorithm FedGMT to overcome data heterogeneity and the pitfalls of previous SAM-based methods. We analyze the convergence of FedGMT and conduct extensive experiments on visual and text datasets in a variety of scenarios, demonstrating that FedGMT achieves competitive accuracy with state-of-the-art FL methods while minimizing computation and communication overhead. Code is available athttps://github.com/harrylee999/FL-SAM.

Paperid:2899 poster

Authors:Kosuke Sugiyama, Masato Uchida

Title: Unified Analysis of Continuous Weak Features Learning with Applications to Learning from Missing Data

Abstract:
This paper addresses weak features learning (WFL), focusing on learning scenarios characterized by low-quality input features (weak features; WFs) that arise due to missingness, measurement errors, or ambiguous observations. We present a theoretical formalization and error analysis of WFL for continuous WFs (continuous WFL), which has been insufficiently explored in existing literature. A previous study established formalization and error analysis for WFL with discrete WFs (discrete WFL); however, this analysis does not extend to continuous WFs due to the inherent constraints of discreteness. To address this, we propose a theoretical framework specifically designed for continuous WFL, systematically capturing the interactions between feature estimation models for WFs and label prediction models for downstream tasks. Furthermore, we derive the theoretical conditions necessary for both sequential and iterative learning methods to achieve consistency. By integrating the findings of this study on continuous WFL with the existing theory of discrete WFL, we demonstrate that the WFL framework is universally applicable, providing a robust theoretical foundation for learning with low-quality features across diverse application domains.

Paperid:2900 poster

Authors:Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, Talia Konkle

Title: Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Abstract:
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the data’s convex hull. This geometric anchoring significantly enhances the stability and plausibility of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover “true�?classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

Paperid:2901 poster

Authors:Lincan Cai, Jingxuan Kang, Shuang Li, Wenxuan Ma, Binhui Xie, Zhida Qin, Jian Liang

Title: From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

Abstract:
Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose anAttention-BasedSelection (ABS) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment.ABSachieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably,ABSis training-free and even rivals few-shot and test-time adaptation methods.

Paperid:2902 poster

Authors:Thien Hang Nguyen, Huy Nguyen

Title: Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees

Abstract:
We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training of large-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint fromO(d)toO(d), wheredis the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove high-probability convergence rates for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the training tokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80% with minimal additional hyperparameter tuning.

Paperid:2903 poster

Authors:Tianchi Xing, Bonan Li, Congying Han, Xinmin Qiu, Zicheng Zhang, Tiande Guo

Title: MIRROR: Make Your Object-Level Multi-View Generation More Consistent with Training-Free Rectification

Abstract:
Multi-view Diffusion has greatly advanced the development of 3D content creation by generating multiple images from distinct views, achieving remarkable photorealistic results. However, existing works are still vulnerable to inconsistent 3D geometric structures (commonly known as Janus Problem) and severe artifacts. In this paper, we introduce MIRROR, a versatile plug-and-play method that rectifies such inconsistencies in a training-free manner, enabling the acquisition of high-fidelity, realistic structures without compromising diversity. Our key idea focuses on tracing the motion trajectory of physical points across adjacent viewpoints, enabling rectifications based on neighboring observations of the same region. Technically, MIRROR comprises two core modules: Trajectory Tracking Module (TTM) for pixel-wise trajectory tracking that labels identical points across views, and Feature Rectification Module (FRM) for explicitly adjustment of each pixel embedding on noisy synthesized images by minimizing the distance to corresponding block features in neighboring views, thereby achieving consistent outputs. Extensive evaluations demonstrate that MIRROR can seamlessly integrate with a diverse range of off-the-shelf object-level multi-view diffusion models, significantly enhancing both the consistency and the fidelity in an efficient way.

Paperid:2904 poster

Authors:Sungjune Kim, Gyeongrok Oh, Heeju Ko, Daehyun Ji, Dongwook Lee, Byung-Jun Lee, Sujin Jang, Sangpil Kim

Title: Test-Time Adaptation for Online Vision-Language Navigation with Feedback-based Reinforcement Learning

Abstract:
Navigating in an unfamiliar environment during deployment poses a critical challenge for a vision-language navigation (VLN) agent. Yet, test-time adaptation (TTA) remains relatively underexplored in robotic navigation, leading us to the fundamental question: what are the key properties of TTA for online VLN? In our view, effective adaptation requires three qualities: 1) flexibility in handling different navigation outcomes, 2) interactivity with external environment, and 3) maintaining a harmony between plasticity and stability. To address this, we introduce FeedTTA, a novel TTA framework for online VLN utilizing feedback-based reinforcement learning. Specifically, FeedTTA learns by maximizing binary episodic feedback, a practical setup in which the agent receives a binary scalar after each episode that indicates the success or failure of the navigation. Additionally, we propose a gradient regularization technique that leverages the binary structure of FeedTTA to achieve a balance between plasticity and stability during adaptation. Our extensive experiments on challenging VLN benchmarks demonstrate the superior adaptability of FeedTTA, even outperforming the state-of-the-art offline training methods in REVERIE benchmark with a single stream of learning.

Paperid:2905 poster

Authors:Antoine Gonon, Nicolas Brisebarre, Elisa Riccietti, Rémi Gribonval

Title: A Rescaling-Invariant Lipschitz Bound Based on Path-Metrics for Modern ReLU Network Parameterizations

Abstract:
Robustness with respect to weight perturbations underpins guarantees for generalization, pruning and quantization. Existing guarantees rely onLipschitz bounds in parameter space, cover only plain feed-forward MLPs, and break under the ubiquitous neuron-wise rescaling symmetry of ReLU networks. We prove a new Lipschitz inequality expressed through the $\ell^{1}$-path-metricof the weights. The bound is (i)rescaling-invariantby construction and (ii) applies to any ReLU-DAG architecture with any combination of convolutions, skip connections, pooling, and frozen (inference-time) batch-normalization —thus encompassing ResNets, U-Nets, VGG-style CNNs, and more. By respecting the network’s natural symmetries, the new bound strictly sharpens prior parameter-space bounds and can be computed in two forward passes. To illustrate its utility, we derive from it a symmetry-aware pruning criterion and show—through a proof-of-concept experiment on a ResNet-18 trained on ImageNet—that its pruning performance matches that of classical magnitude pruning, while becoming totally immune to arbitrary neuron-wise rescalings.

Paperid:2906 poster

Authors:Xinyue Chen, Jinfeng Peng, YuHao Li, Xiaorong Pu, Yang Yang, Yazhou Ren

Title: An Effective and Secure Federated Multi-View Clustering Method with Information-Theoretic Perspective

Abstract:
Recently, federated multi-view clustering (FedMVC) has gained attention for its ability to mine complementary clustering structures from multiple clients without exposing private data. Existing methods mainly focus on addressing the feature heterogeneity problem brought by views on different clients and mitigating it using shared client information. Although these methods have achieved performance improvements, the information they choose to share, such as model parameters or intermediate outputs, inevitably raises privacy concerns. In this paper, we propose an Effective and Secure Federated Multi-view Clustering method, ESFMC, to alleviate the dilemma between privacy protection and performance improvement. This method leverages the information-theoretic perspective to split the features extracted locally by clients, retaining sensitive information locally and only sharing features that are highly relevant to the task. This can be viewed as a form of privacy-preserving information sharing, reducing privacy risks for clients while ensuring that the server can mine high-quality global clustering structures. Theoretical analysis and extensive experiments demonstrate that the proposed method more effectively mitigates the trade-off between privacy protection and performance improvement compared to state-of-the-art methods.

Paperid:2907 poster

Authors:Jiawei Zhang, Shuang Yang, Bo Li

Title: UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

Abstract:
Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model’s reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available athttps://github.com/AI-secure/UDora.

Paperid:2908 poster

Authors:Zhengming Chen, Yewei Xia, Feng Xie, Jie Qiao, Zhifeng Hao, Ruichu Cai, Kun Zhang

Title: Identification of Latent Confounders via Investigating the Tensor Ranks of the Nonlinear Observations

Abstract:
We study the problem of learning discrete latent variable causal structures from mixed-type observational data. Traditional methods, such as those based on the tensor rank condition, are designed to identify discrete latent structure models and provide robust identification bounds for discrete causal models. However, when observed variables—specifically, those representing the children of latent variables—are collected at various levels with continuous data types, the tensor rank condition is not applicable, limiting further causal structure learning for latent variables. In this paper, we consider a more general case where observed variables can be either continuous or discrete, and further allow for scenarios where multiple latent parents cause the same set of observed variables. We show that, under the completeness condition, it is possible to discretize the data in a way that satisfies the full-rank assumption required by the tensor rank condition. This enables the identifiability of discrete latent structure models within mixed-type observational data. Moreover, we introduce the two-sufficient measurement condition, a more general structural assumption under which the tensor rank condition holds and the underlying latent causal structure is identifiable by a proposed two-stage identification algorithm. Extensive experiments on both simulated and real-world data validate the effectiveness of our method.

Paperid:2909 poster

Authors:Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, Bryan Catanzaro

Title: Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Abstract:
Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert-annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach.

Paperid:2910 poster

Authors:Nikhil Pratap Ghanathe, Steven J E Wilton

Title: QUTE: Quantifying Uncertainty in TinyML models with Early-exit-assisted ensembles for model-monitoring

Abstract:
Uncertainty quantification (UQ) provides a resource-efficient solution for on-device monitoring of tinyML models deployed remotely without access to true labels. However, existing UQ methods impose significant memory and compute demands, making them impractical for ultra-low-power, KB-sized tinyML devices. Prior work has attempted to reduce overhead by using early-exit ensembles to quantify uncertainty in a single forward pass, but these approaches still carry prohibitive costs. To address this, we propose QUTE, a novel resource-efficient early-exit-assisted ensemble architecture optimized for tinyML models. QUTE introduces additional output blocks at the final exit of the base network, distilling early-exit knowledge into these blocks to form a diverse yet lightweight ensemble. We show that QUTE delivers superior uncertainty quality on tiny models, achieving comparable performance on larger models with 59% smaller model sizes than the closest prior work. When deployed on a microcontroller, QUTE demonstrates a 31% reduction in latency on average. In addition, we show that QUTE excels at detecting accuracy-drop events, outperforming all prior works.

Paperid:2911 poster

Authors:Jingyu Liu, Beidi Chen, Ce Zhang

Title: Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Abstract:
Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then sent to the main model for processing. We evaluate SpecPrefill with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. SpecPrefill manages to serve Llama-3.1-405B-Instruct-FP8 with up to 7×maximal end-to-end QPS on real downstream tasks and 7.66×TTFT improvement.

Paperid:2912 poster

Authors:Roman Plaud, Alexandre Perez-Lebel, Matthieu Labeau, Antoine Saillenfest, Thomas Bonald

Title: To Each Metric Its Decoding: Post-Hoc Optimal Decision Rules of Probabilistic Hierarchical Classifiers

Abstract:
Hierarchical classification offers an approach to incorporate the concept of mistake severity by leveraging a structured, labeled hierarchy. However, decoding in such settings frequently relies on heuristic decision rules, which may not align with task-specific evaluation metrics. In this work, we propose a framework for the optimal decoding of an output probability distribution with respect to a target metric. We derive optimal decision rules for increasingly complex prediction settings, providing universal algorithms when candidates are limited to the set of nodes. In the most general case of predicting asubset of nodes, we focus on rules dedicated to the hierarchicalhFβscores, tailored to hierarchical settings. To demonstrate the practical utility of our approach, we conduct extensive empirical evaluations, showcasing the superiority of our proposed optimal strategies, particularly in underdetermined scenarios. These results highlight the potential of our methods to enhance the performance and reliability of hierarchical classifiers in real-world applications.

Paperid:2913 poster

Authors:Zahra Babaiee, Peyman Kiasari, Daniela Rus, Radu Grosu

Title: Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Abstract:
Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'—the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems�?capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}.

Paperid:2914 poster

Authors:Saul Santos, António Farinhas, Daniel C McNamee, Andre Martins

Title: �?Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation

Abstract:
Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, which often leads to information loss. In this paper, we introduce $\infty$-Video, which is able to process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by making them able to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories which evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.

Paperid:2915 poster

Authors:Guopeng Lin, Ruisheng Zhou, Shuyu Chen, Weili Han, Jin Tan, Wenjing Fang, Lei Wang, Tao Wei

Title: Kona: An Efficient Privacy-Preservation Framework for KNN Classification by Communication Optimization

Abstract:
K-nearest neighbors (KNN) classification plays a significant role in various applications due to its interpretability. The accuracy of KNN classification relies heavily on large amounts of high-quality data, which are often distributed among different parties and contain sensitive information. Dozens of privacy-preserving frameworks have been proposed for performing KNN classification with data from different parties while preserving data privacy. However, existing privacy-preserving frameworks for KNN classification demonstrate communication inefficiency in the online phase due to two main issues: (1) They suffer from huge communication size for secure Euclidean square distance computations. (2) They require numerous communication rounds to select the $k$ nearest neighbors. In this paper, we present $\texttt{Kona}$, an efficient privacy-preserving framework for KNN classification. We resolve the above communication issues by (1) designing novel Euclidean triples, which eliminate the online communication for secure Euclidean square distance computations, (2) proposing a divide-and-conquer bubble protocol, which significantly reduces communication rounds for selecting the $k$ nearest neighbors. Experimental results on eight real-world datasets demonstrate that $\texttt{Kona}$ significantly outperforms the state-of-the-art framework by $1.1\times \sim 3121.2\times$ in communication size, $19.1\times \sim 5783.2\times$ in communication rounds, and $1.1\times \sim 232.6\times$ in runtime.

Paperid:2916 poster

Authors:Kevin Xu, Issei Sato

Title: On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Abstract:
Looped Transformers provide advantages in parameter efficiency, computational capabilities, and generalization for reasoning tasks. However, their expressive power regarding function approximation remains underexplored. In this paper, we establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts the incorporation of scaling parameters for each loop, conditioned on timestep encoding. Experiments validate the theoretical results, showing that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding.

Paperid:2917 poster

Authors:Zijun Liao, Jinbiao Chen, Debing Wang, Zizhen Zhang, Jiahai Wang

Title: BOPO: Neural Combinatorial Optimization via Best-anchored and Objective-guided Preference Optimization

Abstract:
Neural Combinatorial Optimization (NCO) has emerged as a promising approach for NP-hard problems. However, prevailing RL-based methods suffer from low sample efficiency due to sparse rewards and underused solutions. We proposeBest-anchored and Objective-guided Preference Optimization (BOPO), a training paradigm that leverages solution preferences via objective values. It introduces: (1) a best-anchored preference pair construction for better explore and exploit solutions, and (2) an objective-guided pairwise loss function that adaptively scales gradients via objective differences, removing reliance on reward models or reference policies. Experiments on Job-shop Scheduling Problem (JSP), Traveling Salesman Problem (TSP), and Flexible Job-shop Scheduling Problem (FJSP) show BOPO outperforms state-of-the-art neural methods, reducing optimality gaps impressively with efficient inference. BOPO is architecture-agnostic, enabling seamless integration with existing NCO models, and establishes preference optimization as a principled framework for combinatorial optimization.

Paperid:2918 poster

Authors:Ruida WANG, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, Tong Zhang

Title: MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

Abstract:
Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We proposeMA-LoT:Model-CollAboration Lean-based Long Chain-of-Thought, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novelLoT-Transfer Learningtraining-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a61.07%accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.

Paperid:2919 poster

Authors:Biswadeep Chakraborty, Harshit Kumar, Saibal Mukhopadhyay

Title: A Dynamical Systems-Inspired Pruning Strategy for Addressing Oversmoothing in Graph Attention Networks

Abstract:
Graph Neural Networks (GNNs) face a critical limitation known as oversmoothing, where increasing network depth leads to homogenized node representations, severely compromising their expressiveness. We present a novel dynamical systems perspective on this challenge, revealing oversmoothing as an emergent property of GNNs' convergence to low-dimensional attractor states. Based on this insight, we introduceDYNAMO-GAT, which combines noise-driven covariance analysis with Anti-Hebbian learning to dynamically prune attention weights, effectively preserving distinct attractor states. We provide theoretical guarantees for DYNAMO-GAT's effectiveness and demonstrate its superior performance on benchmark datasets, consistently outperforming existing methods while requiring fewer computational resources. This work establishes a fundamental connection between dynamical systems theory and GNN behavior, providing both theoretical insights and practical solutions for deep graph learning.

Paperid:2920 poster

Authors:Yongqiang Yao, Jingru Tan, Feizhao Zhang, Jiahao Hu, Yazhe Niu, JinXin, Bo Li, Pengfei Liu, Ruihao Gong, Dahua Lin, Ningyi Xu

Title: OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

Abstract:
Vision-language instruction-tuning models have recently achieved significant performance improvements. In this work, we discover that large-scale 3D parallel training on those models leads to an imbalanced computation load across different devices. The vision and language parts are inherently heterogeneous: their data distribution and model architecture differ significantly, which affects distributed training efficiency. To address this issue, we rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices. Specifically, for the data, instances are grouped into new balanced mini-batches within and across devices. A search-based method is employed for the model to achieve a more balanced partitioning. For memory optimization, we adaptively adjust the re-computation strategy for each partition to utilize the available memory fully. These three perspectives are not independent but are closely connected, forming an omniverse balanced training framework. Extensive experiments are conducted to validate the effectiveness of our method. Compared with the open-source training code of InternVL-Chat, training time is reduced greatly, achieving about 1.8$\times$ speed-up. Our method's efficacy and generalizability are further validated across various models and datasets. Codes will be released athttps://github.com/ModelTC/OmniBal.

Paperid:2921 poster

Authors:Yan Chen, Qinxun Bai, Yiteng Zhang, Maria Dimakopoulou, Shi Dong, Qi Sun, Zhengyuan Zhou

Title: Concurrent Reinforcement Learning with Aggregated States via Randomized Least Squares Value Iteration

Abstract:
Designing learning agents that explore efficiently in a complex environment has been widely recognized as a fundamental challenge in reinforcement learning. While a number of works have demonstrated the effectiveness of techniques based on randomized value functions on a single agent, it remains unclear, from a theoretical point of view, whether injecting randomization can help a society of agents concurently explore an environment. The theoretical results established in this work tender an affirmative answer to this question. We adapt the concurrent learning framework to randomized least-squares value iteration (RLSVI) with aggregated state representation. We demonstrate polynomial worst-case regret bounds in both finite- and infinite-horizon environments. In both setups the per-agent regret decreases at an optimal rate of $\Theta\left(\frac{1}{\sqrt{N}}\right)$, highlighting the advantage of concurent learning. Our algorithm exhibits significantly lower space complexity compared to Russo (2019) and Agrawal et. al (2021). We reduce the space complexity by a factor of $K$ while incurring only a $\sqrt{K}$ increase in the worst-case regret bound, compared to Russo (2019) and Agrawal et. al (2021). Interestingly, our algorithm improves the worst-case regret bound of Russo (2019) by a factor of $H^{1/2}$, matching the improvement in Agrawal et. al (2021). However, this result is achieved through a fundamentally different algorithmic enhancement and proof technique. Additionally, we conduct numerical experiments to demonstrate our theoretical findings.

Paperid:2922 poster

Authors:Anqi Mao, Mehryar Mohri, Yutao Zhong

Title: Mastering Multiple-Expert Routing: RealizableH-Consistency and Strong Guarantees for Learning to Defer

Abstract:
The problem of learning to defer with multiple experts consists of optimally assigning input instances to experts, balancing the trade-off between their accuracy and computational cost. This is a critical challenge in natural language generation, but also in other fields such as image processing, and medical diagnostics. Recent studies have proposed surrogate loss functions to optimize deferral, but challenges remain in ensuring their consistency properties. This paper introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees. We address open questions regarding realizableH-consistency,H-consistency bounds, and Bayes-consistency for both single-stage (jointly learning predictor and deferral function) and two-stage (learning only the deferral function with a fixed expert) learning scenarios. For single-stage deferral, we introduce a family of new realizableH-consistent surrogate losses and further proveH-consistency for a selected member. For two-stage deferral, we derive new surrogate losses that achieve realizableH-consistency,H-consistency bounds, and Bayes-consistency for the two-expert scenario and, under natural assumptions, multiple-expert scenario. Additionally, we provide enhanced theoretical guarantees under low-noise assumptions for both scenarios. Finally, we report the results of experiments using our proposed surrogate losses, comparing their performance against existing baselines.

Paperid:2923 poster

Authors:Weiyu Guo, Ziyue Qiao, Ying Sun, Yijie Xu, Hui Xiong

Title: Revisiting Noise Resilience Strategies in Gesture Recognition: Short-Term Enhancement in sEMG Analysis

Abstract:
Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in real-world environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques. In this work, we revisit the problem from a short-term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM), which can be easily integrated with various models. STEM offers several benefits: 1) Noise-resistant, enhanced robustness against noise without manual data augmentation; 2) Adaptability, adaptable to various models; and 3) Inference efficiency, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism. In particular, we incorporate STEM into a transformer, creating the Short-Term Enhanced Transformer (STET). Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20%. We report promising results on classification and regression tasks and demonstrate that STEM generalizes across different gesture recognition tasks. The code is available athttps://anonymous.4open.science/r/short_term_semg.

Paperid:2924 poster

Authors:Christophe Vauthier, Anna Korba, Quentin Mérigot

Title: Towards Understanding Gradient Dynamics of the Sliced-Wasserstein Distance via Critical Point Analysis

Abstract:
In this paper, we investigate the properties of the Sliced Wasserstein Distance (SW) when employed as an objective functional. The SW metric has gained significant interest in the optimal transport and machine learning literature, due to its ability to capture intricate geometric properties of probability distributions while remaining computationally tractable, making it a valuable tool for various applications, including generative modeling and domain adaptation. Our study aims to provide a rigorous analysis of the critical points arising from the optimization of the SW objective. By computing explicit perturbations, we establish that stable critical points of SW cannot concentrate on segments. This stability analysis is crucial for understanding the behaviour of optimization algorithms for models trained using the SW objective. Furthermore, we investigate the properties of the SW objective, shedding light on the existence and convergence behavior of critical points. We illustrate our theoretical results through numerical experiments.

Paperid:2925 poster

Authors:Xuqian Xue, Yiming Lei, Qi Cai, Hongming Shan, Junping Zhang

Title: PROTOCOL: Partial Optimal Transport-enhanced Contrastive Learning for Imbalanced Multi-view Clustering

Abstract:
While contrastive multi-view clustering has achieved remarkable success, it implicitly assumes balanced class distribution. However, real-world multi-view data primarily exhibits class imbalance distribution. Consequently, existing methods suffer performance degradation due to their inability to perceive and model such imbalance. To address this challenge, we present the first systematic study of imbalanced multi-view clustering, focusing on two fundamental problems:i. perceiving class imbalance distribution, andii. mitigating representation degradation of minority samples. We propose PROTOCOL, a novel PaRtial Optimal TranspOrt-enhanced COntrastive Learning framework for imbalanced multi-view clustering. First, for class imbalance perception, we map multi-view features into a consensus space and reformulate the imbalanced clustering as a partial optimal transport (POT) problem, augmented withprogressive mass constraintsandweighted KL divergencefor class distributions. Second, we develop a POT-enhanced class-rebalanced contrastive learning at both feature and class levels, incorporatinglogit adjustmentandclass-sensitive learningto enhance minority sample representations. Extensive experiments demonstrate that PROTOCOL significantly improves clustering performance on imbalanced multi-view data, filling a critical research gap in this field.

Paperid:2926 poster

Authors:Chenyin Gao, Peter B. Gilbert, Larry Han

Title: Bridging Fairness and Efficiency in Conformal Inference: A Surrogate-Assisted Group-Clustered Approach

Abstract:
Standard conformal prediction ensures marginal coverage but consistently undercovers underrepresented groups, limiting its reliability for fair uncertainty quantification. Group fairness requires prediction sets to achieve a user-specified coverage level within each protected group. While group-wise conformal inference meets this requirement, it often produces excessively wide prediction sets due to limited sample sizes in underrepresented groups, highlighting a fundamental tradeoff between fairness and efficiency. To bridge this gap, we introduce Surrogate-Assisted Group-Clustered Conformal Inference (SAGCCI), a framework that improves efficiency through two key innovations: (1) clustering protected groups with similar conformal score distributions to enhance precision while maintaining fairness, and (2) deriving an efficient influence function that optimally integrates surrogate outcomes to construct tighter prediction sets. Theoretically, SAGCCI guarantees approximate group-conditional coverage in a doubly robust manner under mild convergence conditions, enabling flexible nuisance model estimation. Empirically, through simulations and an analysis of the phase 3 Moderna COVE COVID-19 vaccine trial, we demonstrate that SAGCCI outperforms existing methods, producing narrower prediction sets while maintaining valid group-conditional coverage, effectively balancing fairness and efficiency in uncertainty quantification.

Paperid:2927 poster

Authors:Aleksandr Gushchin, Khaled Abud, Georgii Bychkov, Ekaterina Shumitskaya, Anna Chistyakova, Sergey Lavrushkin, Bader Rasheed, Kirill Malyshev, Dmitriy S. Vatolin, Anastasia Antsiferova

Title: Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics

Abstract:
Modern neural-network-based Image Quality Assessment (IQA) metrics are vulnerable to adversarial attacks, which can be exploited to manipulate search engine rankings, benchmark results, and content quality assessments, raising concerns about the reliability of IQA metrics in critical applications. This paper presents the first comprehensive study of IQA defense mechanisms in response to adversarial attacks on these metrics to pave the way for safer use of IQA metrics. We systematically evaluated 30 defense strategies, including purification, training-based, and certified methods --- and applied 14 adversarial attacks in adaptive and non-adaptive settings to compare these defenses on 9 no-reference IQA metrics. Our proposed benchmark aims to guide the development of IQA defense methods and is open to submissions; the latest results and code are athttps://msu-video-group.github.io/adversarial-defenses-for-iqa/.

Paperid:2928 poster

Authors:Jiacheng Cheng, Xiwen Yao, Xiang Yuan, Junwei Han

Title: Not All Tokens Matter All The Time: Dynamic Token Aggregation Towards Efficient Detection Transformers

Abstract:
The substantial computational demands of detection transformers (DETRs) hinder their deployment in resource-constrained scenarios, with the encoder consistently emerging as a critical bottleneck. A promising solution lies in reducing token redundancy within the encoder. However, existing methods perform static sparsification while ignoring the varying importance of tokens across different levels and encoder blocks for object detection, leading to suboptimal sparsification and performance degradation. In this paper, we proposeDynamic DETR(Dynamictoken aggregation forDEtectionTRansformers), a novel strategy that leverages inherent importance distribution to control token density and performs multi-level token sparsification. Within each stage, we apply a proximal aggregation paradigm for low-level tokens to maintain spatial integrity, and a holistic strategy for high-level tokens to capture broader contextual information. Furthermore, we propose center-distance regularization to align the distribution of tokens throughout the sparsification process, thereby facilitating the representation consistency and effectively preserving critical object-specific patterns. Extensive experiments on canonical DETR models demonstrate that Dynamic DETR is broadly applicable across various models and consistently outperforms existing token sparsification methods.

Paperid:2929 poster

Authors:Shihao Zou, Qingfeng Li, Wei Ji, Jingjing Li, Yongkui Yang, Guoqi Li, Chao Dong

Title: SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention andO(T)Complexity

Abstract:
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexityO(T). Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving×16,×10and×5improvements on the three tasks.https://github.com/JimmyZou/SpikeVideoFormer

Paperid:2930 poster

Authors:Jaeho Kim, Seulki Lee

Title: TransPL: VQ-Code Transition Matrices for Pseudo-Labeling of Time Series Unsupervised Domain Adaptation

Abstract:
Unsupervised domain adaptation (UDA) for time series data remains a critical challenge in deep learning, with traditional pseudo-labeling strategies failing to capture temporal patterns and channel-wise shifts between domains, producing sub-optimal pseudo labels. As such, we introduce TransPL, a novel approach that addresses these limitations by modeling the joint distributionP(X,y)of the source domain through code transition matrices, where the codes are derived from vector quantization (VQ) of time series patches. Our method constructs class- and channel-wise code transition matrices from the source domain and employs Bayes' rule for target domain adaptation, generating pseudo-labels based on channel-wise weighted class-conditional likelihoods. TransPL offers three key advantages: explicit modeling of temporal transitions and channel-wise shifts between different domains, versatility towards different UDA scenarios (e.g., weakly-supervised UDA), and explainable pseudo-label generation. We validate TransPL's effectiveness through extensive analysis on four time series UDA benchmarks and confirm that it consistently outperforms state-of-the-art pseudo-labeling methods by a strong margin (6.1% accuracy improvement, 4.9% F1 improvement), while providing interpretable insights into the domain adaptation process through its learned code transition matrices.

Paperid:2931 poster

Authors:Kangyu Zhu, Peng Xia, Yun Li, Hongtu Zhu, Sheng Wang, Huaxiu Yao

Title: MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Abstract:
The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately addressed clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. In response, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by 14.2% and 51.7% on the Med-VQA and report generation tasks, respectively. Our code are available inhttps://github.com/aiming-lab/MMedPO}{https://github.com/aiming-lab/MMedPO.

Paperid:2932 poster

Authors:David Vigouroux, Joseba Dalmau, Louis Béthune, Victor Boutin

Title: Deep Sturm–Liouville: From Sample-Based to 1D Regularization with Learnable Orthogonal Basis Functions

Abstract:
Although Artificial Neural Networks (ANNs) have achieved remarkable success across various tasks, they still suffer from limited generalization. We hypothesize that this limitation arises from the traditional sample-based (0--dimensionnal) regularization used in ANNs. To overcome this, we introduce Deep Sturm-Liouville (DSL), a novel function approximator that enables continuous 1D regularization along field lines in the input space by integrating the Sturm-Liouville Theorem (SLT) into the deep learning framework. DSL defines field lines traversing the input space, along which a Sturm-Liouville problem is solved to generate orthogonal basis functions, enforcing implicit regularization thanks to the desirable properties of SLT. These basis functions are linearly combined to construct the DSL approximator. Both the vector field and basis functions are parameterized by neural networks and learned jointly. We demonstrate that the DSL formulation naturally arises when solving a Rank-1 Parabolic Eigenvalue Problem. DSL is trained efficiently using stochastic gradient descent via implicit differentiation and achieves competitive performance on diverse multivariate datasets, including high-dimensional image datasets such as MNIST and CIFAR-10.

Paperid:2933 poster

Authors:Côme Fiegel, Pierre Menard, Tadashi Kozuno, Michal Valko, Vianney Perchet

Title: The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback

Abstract:
We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound ofO(T�?/8)on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performances, with the best attainable rate beingO(T�?/4)in contrast to the usualO(T�?/2)rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate. The first algorithm leverages a straightforward tradeoff between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.

Paperid:2934 poster

Authors:Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu

Title: One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

Abstract:
Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs�?ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, COUNTERMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that COUNTERMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.

Paperid:2935 poster

Authors:Ching-Hua Lee, Chouchang Yang, Jaejin Cho, Yashas Malur Saidutta, Rakshith Sharma Srinivasa, Yilin Shen, Hongxia Jin

Title: RestoreGrad: Signal Restoration Using Conditional Denoising Diffusion Models with Jointly Learned Prior

Abstract:
Denoising diffusion probabilistic models (DDPMs) can be utilized to recover a clean signal from its degraded observation(s) by conditioning the model on the degraded signal. The degraded signals are themselves contaminated versions of the clean signals; due to this correlation, they may encompass certain useful information about the target clean data distribution. However, existing adoption of the standard Gaussian as the prior distribution in turn discards such information when shaping the prior, resulting in sub-optimal performance. In this paper, we propose to improve conditional DDPMs for signal restoration by leveraging a more informative prior that is jointly learned with the diffusion model. The proposed framework, called RestoreGrad, seamlessly integrates DDPMs into the variational autoencoder (VAE) framework, taking advantage of the correlation between the degraded and clean signals to encode a better diffusion prior. On speech and image restoration tasks, we show that RestoreGrad demonstrates faster convergence (5-10 times fewer training steps) to achieve better quality of restored signals over existing DDPM baselines and improved robustness to using fewer sampling steps in inference time (2-2.5 times fewer), advocating the advantages of leveraging jointly learned prior for efficiency improvements in the diffusion process.

Paperid:2936 poster

Authors:Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno de Moraes Dumont, Sanmi Koyejo

Title: Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

Abstract:
Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving $>$ 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables, and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview �?the strongest evaluated model �?scores 41.9%, but its accuracy drops by 19.6 % (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement ("boxed") accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available athttps://github.com/brando90/putnam-axiom.

Paperid:2937 poster

Authors:Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao

Title: Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

Abstract:
Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting inAoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introducesLiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain.

Paperid:2938 poster

Authors:Mustafa Burak Gurbuz, Xingyu Zheng, Constantine Dovrolis

Title: PEAKS: Selecting Key Training Examples Incrementally via Prediction Error Anchored by Kernel Similarity

Abstract:
As deep learning continues to be driven by ever-larger datasets, understanding which examples are most important for generalization has become a critical question. While progress in data selection continues, emerging applications require studying this problem in dynamic contexts. To bridge this gap, we pose the Incremental Data Selection (IDS) problem, where examples arrive as a continuous stream, and need to be selected without access to the full data source. In this setting, the learner must incrementally build a training dataset of predefined size while simultaneously learning the underlying task. We find that in IDS, the impact of a new sample on the model state depends fundamentally on both its geometric relationship in the feature space and its prediction error. Leveraging this insight, we propose PEAKS (Prediction Error Anchored by Kernel Similarity), an efficient data selection method tailored for IDS. Our comprehensive evaluations demonstrate that PEAKS consistently outperforms existing selection strategies. Furthermore, PEAKS yields increasingly better performance returns than random selection as training data size grows on real-world datasets. The code is available athttps://github.com/BurakGurbuz97/PEAKS.

Paperid:2939 poster

Authors:Shib Sankar Dasgupta, Michael Boratko, Andrew McCallum

Title: A Geometric Approach to Personalized Recommendation with Set-Theoretic Constraints Using Box Embeddings

Abstract:
Personalized item recommendation typically suffers from data sparsity, which is most often addressed by learning vector representations of users and items via low-rank matrix factorization. While this effectively densifies the matrix by assuming users and movies can be represented by linearly dependent latent features, it does not capture more complicated interactions. For example, vector representations struggle with set-theoretic relationships, such as negation and intersection, e.g. recommending a movie that is “comedy and action, but not romance�? In this work, we formulate the problem of personalized item recommendation as matrix completion where rows are set-theoretically dependent. To capture this set-theoretic dependence we represent each user and attribute by a hyperrectangle or box (i.e. a Cartesian product of intervals). Box embeddings can intuitively be understood as trainable Venn diagrams, and thus not only inherently represent similarity (via the Jaccard index), but also naturally and faithfully support arbitrary set-theoretic relationships. Queries involving set-theoretic constraints can be efficiently computed directly on the embedding space by performing geometric operations on the representations. We empirically demonstrate the superiority of box embeddings over vector-based neural methods on both simple and complex item recommendation queries by up to 30% overall.

Paperid:2940 poster

Authors:Willem Diepeveen, Georgios Batzolis, Zakhar Shumaylov, Carola-Bibiane Schönlieb

Title: Score-based Pullback Riemannian Geometry: Extracting the Data Manifold Geometry using Anisotropic Flows

Abstract:
Data-driven Riemannian geometry has emerged as a powerful tool for interpretable representation learning, offering improved efficiency in downstream tasks. Moving forward, it is crucial to balance cheap manifold mappings with efficient training algorithms. In this work, we integrate concepts from pullback Riemannian geometry and generative models to propose a framework for data-driven Riemannian geometry that is scalable in both geometry and learning: score-based pullback Riemannian geometry. Focusing on unimodal distributions as a first step, we propose a score-based Riemannian structure with closed-form geodesics that pass through the data probability density. With this structure, we construct a Riemannian autoencoder (RAE) with error bounds for discovering the correct data manifold dimension. This framework can naturally be used with anisotropic normalizing flows by adopting isometry regularization during training. Through numerical experiments on diverse datasets, including image data, we demonstrate that the proposed framework produces high-quality geodesics passing through the data support, reliably estimates the intrinsic dimension of the data manifold, and provides a global chart of the manifold. To the best of our knowledge, this is the first scalable framework for extracting the complete geometry of the data manifold.

Paperid:2941 poster

Authors:Tianjian Li, Daniel Khashabi

Title: SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Abstract:
Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay.In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6.03 on Alpaca Eval 2.0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.

Paperid:2942 poster

Authors:Peijie Liu, Fengli Xu, Yong Li

Title: Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models

Abstract:
Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2%, and Dynamic CoT reduces token consumption by more than 35% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.

Paperid:2943 poster

Authors:Wenzhe Niu, Zongxia Xie, Yanru Sun, Wei He, Man Xu, Chao Hao

Title: LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization

Abstract:
Recent research has shown an increasing interest in utilizing pre-trained large language models (LLMs) for a variety of time series applications. However, there are three main challenges when using LLMs as foundational models for time series forecasting: (1) Cross-domain generalization. (2) Cross-modality alignment. (3) Error accumulation in autoregressive frameworks. To address these challenges, we proposedLangTime, alanguage-guided unified model fortimeseries forecasting that incorporates cross-domain pre-training with reinforcement learning-based fine-tuning. Specifically, LangTime constructs Temporal Comprehension Prompts (TCPs), which include dataset-wise and channel-wise instructions, to facilitate domain adaptation and condense time series into a single token, enabling LLMs to understand better and align temporal data. To improve autoregressive forecasting, we introduce TimePPO, a reinforcement learning-based fine-tuning algorithm. TimePPO mitigates error accumulation by leveraging a multidimensional rewards function tailored for time series and a repeat-based value estimation strategy. Extensive experiments demonstrate that LangTime achieves state-of-the-art cross-domain forecasting performance, while TimePPO fine-tuning effectively enhances the stability and accuracy of autoregressive forecasting.

Paperid:2944 poster

Authors:Feiran Li, Qianqian Xu, Shilong Bao, Zhiyong Yang, Xiaochun Cao, Qingming Huang

Title: One Image is Worth a Thousand Words: A Usability Preservable Text-Image Collaborative Erasing Framework

Abstract:
Concept erasing has recently emerged as an effective paradigm to prevent text-to-image diffusion models from generating visually undesirable or even harmful content. However, current removal methods heavily rely on manually crafted text prompts, making it challenging to achieve a high erasure (efficacy) while minimizing the impact on other benign concepts (usability), as illustrated in Fig.1. In this paper, we attribute the limitations to the inherent gap between the text and image modalities, which makes it hard to transfer the intricately entangled concept knowledge from text prompts to the image generation process. To address this, we propose a novel solution by directly integrating visual supervision into the erasure process, introducing the first text-image Collaborative Concept Erasing (Co-Erasing) framework. Specifically, Co-Erasing describes the concept jointly by text prompts and the corresponding undesirable images induced by the prompts, and then reduces the generating probability of the target concept through negative guidance. This approach effectively bypasses the knowledge gap between text and image, significantly enhancing erasure efficacy. Additionally, we design a text-guided image concept refinement strategy that directs the model to focus on visual features most relevant to the specified text concept, minimizing disruption to other benign concepts. Finally, comprehensive experiments suggest that Co-Erasing outperforms state-of-the-art erasure approaches significantly with a better trade-off between efficacy and usability.

Paperid:2945 poster

Authors:Rachel Cummings, Alessandro Epasto, Jieming Mao, Tamalika Mukherjee, Tingting Ou, Peilin Zhong

Title: Differentially Private Space-Efficient Algorithms for Counting Distinct Elements in the Turnstile Model

Abstract:
Theturnstilecontinual release model of differential privacy captures scenarios where a privacy-preserving real-time analysis is sought for a dataset evolving through additions and deletions. In typical applications of real-time data analysis, both the length of the stream $T$ and the size of the universe $|\mathcal{U}|$ from which data come can be extremely large. This motivates the study of private algorithms in the turnstile setting using space sublinear in both $T$ and $|\mathcal{U}|$. In this paper, we give the first sublinear space differentially private algorithms for the fundamental problems of counting distinct elements in the turnstile streaming model. Our algorithm achieves, on arbitrary streams, $O_{\eta}(T^{1/3})$ space and additive error, and a $(1+\eta)$-relative approximation for all $\eta \in (0,1)$. Our result significantly improves upon the space requirements of the state-of-the-art algorithms for this problem, which is linear, approaching the known $\Omega(T^{1/4})$ additive error lower bound for arbitrary streams. Moreover, when a bound $W$ on the number of times an item appears in the stream is known, our algorithm provides $O_{\eta}(\sqrt{W})$ additive error, using $O_{\eta}(\sqrt{W})$ space. This additive error asymptotically matches that of prior work which required instead linear space. Our results address an open question posed by Jain et al. about designing low-memory mechanisms for this problem. We complement this results with a space lower bound for this problem, which shows that any algorithm that uses similar techniques must use space $\Omega(T^{1/3})$.

Paperid:2946 poster

Authors:Laura Yu Zheng, Wenjie Wei, Tony Wu, Jacob Clements, Shreelekha Revankar, Andre Harrison, Yu Shen, Ming Lin

Title: Adaptive Sensitivity Analysis for Robust Augmentation against Natural Corruptions in Image Segmentation

Abstract:
Achieving robustness in image segmentation models is challenging due to the fine-grained nature of pixel-level classification. These models, which are crucial for many real-time perception applications, particularly struggle when faced with natural corruptions in the wild for autonomous systems. While sensitivity analysis can help us understand how input variables influence model outputs, its application to natural and uncontrollable corruptions in training data is computationally expensive. In this work, we present an adaptive, sensitivity-guided augmentation method to enhance robustness against natural corruptions. Our sensitivity analysis on average runs 10 times faster and requires about 200 times less storage than previous sensitivity analysis, enabling practical, on-the-fly estimation during training for a model-free augmentation policy. With minimal fine-tuning, our sensitivity-guided augmentation method achieves improved robustness on both real-world and synthetic datasets compared to state-of-the-art data augmentation techniques in image segmentation.

Paperid:2947 poster

Authors:Gabriel Thompson, Kai Yue, Chau-Wai Wong, Huaiyu Dai

Title: NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel

Abstract:
Decentralized federated learning (DFL) is a collaborative machine learning framework for training a model across participants without a central server or raw data exchange. DFL faces challenges due to statistical heterogeneity, as participants often possess data of different distributions reflecting local environments and user behaviors. Recent work has shown that the neural tangent kernel (NTK) approach, when applied to federated learning in a centralized framework, can lead to improved performance. We propose an approach leveraging the NTK to train client models in the decentralized setting, while introducing a synergy between NTK-based evolution and model averaging. This synergy exploits inter-client model deviation and improves both accuracy and convergence in heterogeneous settings. Empirical results demonstrate that our approach consistently achieves higher accuracy than baselines in highly heterogeneous settings, where other approaches often underperform. Additionally, it reaches target performance in 4.6 times fewer communication rounds. We validate our approach across multiple datasets, network topologies, and heterogeneity settings to ensure robustness and generalization. Source code for NTK-DFL is available athttps://github.com/Gabe-Thomp/ntk-dfl}{https://github.com/Gabe-Thomp/ntk-dfl

Paperid:2948 poster

Authors:Lily H Zhang, Hamid Dadkhahi, Mara Finkelstein, Firas Trabelsi, Jiaming Luo, Markus Freitag

Title: Learning from others' mistakes: Finetuning machine translation models with span-level error annotations

Abstract:
Despite growing interest in incorporating feedback to improve language models, most efforts focus only on sequence-level annotations. In this work, we explore the potential of utilizing fine-grained span-level annotations from offline datasets to improve model quality. We develop a simple finetuning algorithm, called Training with Annotations (TWA), to directly train machine translation models on such annotated data. TWA utilizes targeted span-level error information while also flexibly learning what to penalize within a span. Moreover, TWA considers the overall trajectory of a sequence when deciding which non-error spans to utilize as positive signals. Experiments on English-German and Chinese-English machine translation show that TWA outperforms baselines such as supervised finetuning on sequences filtered for quality and Direct Preference Optimization on pairs constructed from the same data.

Paperid:2949 poster

Authors:Zhengzhao Pan, Hua Chen, Xiaogang Zhang

Title: DiffAdvMAP: Flexible Diffusion-Based Framework for Generating Natural Unrestricted Adversarial Examples

Abstract:
Unrestricted adversarial examples(UAEs) have posed greater threats to deep neural networks(DNNs) than perturbation-based adversarial examples(AEs) because they can make extensive changes to images without being restricted in a fixed norm perturbation budget. Although current diffusion-based methods can generate more natural UAEs than other unrestricted attack methods, the overall effectiveness of such methods is restricted since they are designed for specific attack conditions. Additionally, the naturalness of UAEs still has room for improvement, as these methods primarily focus on leveraging diffusion models as strong priors to enhance the generation process. This paper proposes a flexible framework named Diffusion-based Adversarial Maximum a Posterior(DiffAdvMAP) to generate more natural UAEs for various scenarios. DiffAdvMAP approaches the generation of UAEs by sampling images from posterior distributions, which is achieved by approximating the posterior distribution of UAEs using the prior distribution of real data learned by the diffusion model. This process enhances the naturalness of the UAEs. By incorporating an adversarial constraint to ensure the effectiveness of the attack, DiffAdvMAP exhibits excellent attack ability and defense robustness. A reconstruction constraint is designed to enhance its flexibility, which allows DiffAdvMAP to be tailored to various attack scenarios. Experimental results on Imagenet show that we achieve a better trade-off between image quality, flexibility, and transferability than baseline unrestricted adversarial attack methods.

Paperid:2950 poster

Authors:William de Vazelhes, Xiaotong Yuan, Bin Gu

Title: Optimization over Sparse Support-Preserving Sets: Two-Step Projection with Global Optimality Guarantees

Abstract:
In sparse optimization, enforcing hard constraints using the $\ell_0$ pseudo-norm offers advantages like controlled sparsity compared to convex relaxations. However, many real-world applications demand not only sparsity constraints but also some extra constraints. While prior algorithms have been developed to address this complex scenario with mixed combinatorial and convex constraints, they typically require the closed form projection onto the mixed constraints which might not exist, and/or only provide local guarantees of convergence which is different from the global guarantees commonly sought in sparse optimization. To fill this gap, in this paper, we study the problem of sparse optimization with extrasupport-preservingconstraints commonly encountered in the literature. We present a new variant of iterative hard-thresholding algorithm equipped with a two-step consecutive projection operator customized for these mixed constraints, serving as a simple alternative to the Euclidean projection onto the mixed constraint. By introducing a novel trade-off between sparsity relaxation and sub-optimality, we provide global guarantees in objective value for the output of our algorithm, in the deterministic, stochastic, and zeroth-order settings, under the conventional restricted strong-convexity/smoothness assumptions. As a fundamental contribution in proof techniques, we develop a novel extension of the classic three-point lemma to the considered two-step non-convex projection operator, which allows us to analyze the convergence in objective value in an elegant way that has not been possible with existing techniques. In the zeroth-order case, such technique also improves upon the state-of-the-art result from de Vazelhes et. al. (2022), even in the case without additional constraints, by allowing us to remove a non-vanishing system error present in their work.

Paperid:2951 poster

Authors:Guanglong Sun, Hongwei Yan, Liyuan Wang, Qian Li, Bo Lei, Yi Zhong

Title: Right Time to Learn: Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

Abstract:
Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). While it was originally proposed to train a more compact “student�?model from a large “teacher�?model, many recent efforts have focused on adapting it as an effective way to promote generalization of the model itself, such as online KD and self KD. Here, we propose an easy-to-use and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in the field of biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. We provide an in-depth theoretical and empirical analysis showing that the benefits of the proposed spacing effect in KD stem from seeking a flat minima during stochastic gradient descent (SGD). We perform extensive experiments to demonstrate the effectiveness of our Spaced KD in improving the learning performance of DNNs (e.g., the additional performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively). Our codes have been released on github~\url{https://github.com/SunGL001/Spaced-KD}.

Paperid:2952 poster

Authors:Dayang Wang, Srivathsa Pasumarthi Venkata, Ajit Shankaranarayanan, Greg Zaharchuk

Title: Staged and Physics-Grounded Learning Framework with Hyperintensity Prior for Pre-Contrast MRI Synthesis

Abstract:
Contrast-enhanced MRI enhances pathological visualization but often necessitates Pre-Contrast images for accurate quantitative analysis and comparative assessment. However, Pre-Contrast images are frequently unavailable due to time, cost, or safety constraints, or they may suffer from degradation, making alignment challenging. This limitation hinders clinical diagnostics and the performance of tools requiring combined image types. To address this challenge, we propose a novel staged, physics-grounded learning framework with a hyperintensity prior to synthesize Pre-Contrast images directly from Post-Contrast MRIs. The proposed method can generate high-quality Pre-Contrast images, thus, enabling comprehensive diagnostics while reducing the need for additional imaging sessions, costs, and patient risks. To the best of our knowledge, this is the first Pre-Contrast synthesis model capable of generating images that may be interchangeably used with standard-of-care Pre-Contrast images. Extensive evaluations across multiple datasets, sites, anatomies, and downstream tasks demonstrate the model’s robustness and clinical applicability, positioning it as a valuable tool for contrast-enhanced MRI workflows.

Paperid:2953 poster

Authors:Kosuke Sugiyama, Masato Uchida

Title: A Unified Framework for Generalization Error Analysis of Learning with Arbitrary Discrete Weak Features

Abstract:
In many real-world applications, predictive tasks inevitably involve low-quality input features (Weak Features; WFs) which arise due to factors such as misobservations, missingness, or partial observations. While several methods have been proposed to estimate the true values of specific types of WFs and to solve a downstream task, a unified theoretical framework that comprehensively addresses these methods remains underdeveloped. In this paper, we propose a unified framework called Weak Features Learning (WFL), which accommodates arbitrary discrete WFs and a broad range of learning algorithms, and we demonstrate its validity. Furthermore, we introduce a class of algorithms that learn both the estimation model for WFs and the predictive model for a downstream task and perform a generalization error analysis under finite-sample conditions. Our results elucidate the interdependencies between the estimation errors of WFs and the prediction error of a downstream task, as well as the theoretical conditions necessary for the learning approach to achieve consistency. This work establishes a unified theoretical foundation, providing generalization error analysis and performance guarantees, even in scenarios where WFs manifest in diverse forms.

Paperid:2954 poster

Authors:Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, Jiang Bian

Title: From Complex to Atomic: Enhancing Augmented Generation via Knowledge-Aware Dual Rewriting and Reasoning

Abstract:
Recent advancements in Retrieval-Augmented Generation (RAG) systems have significantly enhanced the capabilities of large language models (LLMs) by incorporating external knowledge retrieval. However, the sole reliance on retrieval is often inadequate for mining deep, domain-specific knowledge and for performing logical reasoning from specialized datasets. To tackle these challenges, we present an approach, which is designed to extract, comprehend, and utilize domain knowledge while constructing a coherent rationale. At the heart of our approach lie four pivotal components: a knowledge atomizer that extracts atomic questions from raw data, a query proposer that generates subsequent questions to facilitate the original inquiry, an atomic retriever that locates knowledge based on atomic knowledge alignments, and an atomic selector that determines which follow-up questions to pose guided by the retrieved information. Through this approach, we implement a knowledge-aware task decomposition strategy that adeptly extracts multifaceted knowledge from segmented data and iteratively builds the rationale in alignment with the initial query and the acquired knowledge. We conduct comprehensive experiments to demonstrate the efficacy of our approach across various benchmarks, particularly those requiring multihop reasoning steps. The results indicate a significant enhancement in performance, up to 12.6% over the second-best method, underscoring the potential of the approach in complex, knowledge-intensive applications.

Paperid:2955 poster

Authors:Canzhe Zhao, Yutian Cheng, Jing Dong, Baoxiang Wang, Shuai Li

Title: Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback

Abstract:
We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate ofO~(k�?8)under bandit feedback in each episodek. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.

Paperid:2956 poster

Authors:Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, Yu Wang

Title: Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization

Abstract:
Large language model (LLM) agents have recently demonstrated impressive capabilities in various domains like open-ended conversation and multi-step decision-making. However, it remains challenging for these agents to solve strategic language games, such as Werewolf, which demand both strategic decision-making and free-form language interactions. Existing LLM agents often suffer from intrinsic bias in their action distributions and limited exploration of the unbounded text action space, resulting in suboptimal performance. To address these challenges, we propose Latent Space Policy Optimization (LSPO), an iterative framework that combines game-theoretic methods with LLM fine-tuning to build strategic language agents. LSPO leverages the observation that while the language space is combinatorially large, the underlying strategy space is relatively compact. We first map free-form utterances into a finite latent strategy space, yielding an abstracted extensive-form game. Then we apply game-theoretic methods like Counterfactual Regret Minimization (CFR) to optimize the policy in the latent space. Finally, we fine-tune the LLM via Direct Preference Optimization (DPO) to align with the learned policy. By iteratively alternating between these steps, our LSPO agents progressively enhance both strategic reasoning and language communication. Experiment on the Werewolf game shows that our agents iteratively expand the strategy space with improving performance and outperform existing Werewolf agents, underscoring their effectiveness in free-form language games with strategic interactions.

Paperid:2957 poster

Authors:Zijing Hu, Fengda Zhang, Kun Kuang

Title: D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

Abstract:
The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.

Paperid:2958 poster

Authors:Chi Zhang, Ziying Jia, George K. Atia, Sihong He, Yue Wang

Title: Pessimism Principle Can Be Effective: Towards a Framework for Zero-Shot Transfer Reinforcement Learning

Abstract:
Transfer reinforcement learning aims to derive a near-optimal policy for a target environment with limited data by leveraging abundant data from related source domains. However, it faces two key challenges: the lack of performance guarantees for the transferred policy, which can lead to undesired actions, and the risk of negative transfer when multiple source domains are involved. We propose a novel framework based on the pessimism principle, which constructs and optimizes a conservative estimation of the target domain’s performance. Our framework effectively addresses the two challenges by providing an optimized lower bound on target performance, ensuring safe and reliable decisions, and by exhibiting monotonic improvement with respect to the quality of the source domains, thereby avoiding negative transfer. We construct two types of conservative estimations, rigorously characterize their effectiveness, and develop efficient distributed algorithms with convergence guarantees. Our framework provides a theoretically sound and practically robust solution for transfer learning in reinforcement learning.

Paperid:2959 poster

Authors:Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng YU, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han

Title: SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

Abstract:
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.

Paperid:2960 poster

Authors:Lian Liu, Xiandong Zhao, Guanchen Li, Dong Li, Mengdi Wang, Yinhe Han, Xiaowei Li, ying wang

Title: BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation

Abstract:
One-shot post-training pruning enhances the deployment of billion-scale large language models (LLMs), with the pruning metric playing a pivotal role in determining which weights to remove. However, existing metrics underperform due to their reliance on a simple symbolic combination of weights and activations, overlooking imbalanced weight magnitudes and the disproportionate influence of activation outliers. To overcome these limitations, we introduce \textbf{BaWA}, a novel pruning metric that systematically \underline{Ba}lances \underline{W}eight and \underline{A}ctivation distributions for more effective pruning. BaWA introduces two key innovations: \textbf{magnitude normalization}, which mitigates weight imbalance across channels for fairer pruning decisions, and \textbf{outlier regularization}, which reduces the impact of activation outliers, ensuring more appropriate channel prioritization. To further enhance its effectiveness, BaWA incorporates an efficient and automatic framework for optimizing normalization and regularization hyperparameters. Extensive experiments validate BaWA as a state-of-the-art (SOTA) pruning metric. For instance, applying BaWA to induce 2:4 sparsity in Mistral-7B reduces perplexity in language comprehension by 2.49 and improves average downstream task accuracy by 3.08%, outperforming the previous SOTA method Wanda.

Paperid:2961 poster

Authors:Mohamad Al Ahdab, john leth, Zheng-Hua Tan

Title: Optimal Sensor Scheduling and Selection for Continuous-Discrete Kalman Filtering with Auxiliary Dynamics

Abstract:
We study the Continuous-Discrete Kalman Filter (CD-KF) for State-Space Models (SSMs) where continuous-time dynamics are observed via multiple sensors with discrete, irregularly timed measurements. Our focus extends to scenarios in which the measurement process is coupled with the states of an auxiliary SSM. For instance, higher measurement rates may increase energy consumption or heat generation, while a sensor’s accuracy can depend on its own spatial trajectory or that of the measured target. Each sensor thus carries distinct costs and constraints associated with its measurement rate and additional constraints and costs on the auxiliary state. We model measurement occurrences as independent Poisson processes with sensor-specific rates and derive an upper bound on the mean posterior covariance matrix of the CD-KF along the mean auxiliary state. The bound is continuously differentiable with respect to the measurement rates, which enables efficient gradient-based optimization. Exploiting this bound, we propose a finite-horizon optimal control framework to optimize measurement rates and auxiliary-state dynamics jointly. We further introduce a deterministic method for scheduling measurement times from the optimized rates. Empirical results in state-space filtering and dynamic temporal Gaussian process regression demonstrate that our approach achieves improved trade-offs between resource usage and estimation accuracy.

Paperid:2962 poster

Authors:Loris Gaven, Thomas Carta, Clément ROMAC, Cédric Colas, sylvain lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

Title: MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces

Abstract:
Open-ended learning agents must efficiently prioritize goals in vast possibility spaces, focusing on those that maximize learning progress (LP). When such autotelic exploration is achieved by LLM agents trained with online RL in high-dimensional and evolving goal spaces, a key challenge for LP prediction is modeling one’s own competence, a form of metacognitive monitoring. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and learning progress online. By capturing semantic relationships between goals, MAGELLAN enables sample-efficient LP estimation and dynamic adaptation to evolving goal spaces through generalization. In an interactive learning environment, we show that MAGELLAN improves LP prediction efficiency and goal prioritization, being the only method allowing the agent to fully master a large and evolving goal space. These results demonstrate how augmenting LLM agents with a metacognitive ability for LP predictions can effectively scale curriculum learning to open-ended goal spaces.

Paperid:2963 poster

Authors:Junmin Zhong, Emiliano Quinones Yumbla, Seyed Yousef Soltanian, Ruofan Wu, Wenlong Zhang, Jennie Si

Title: Reinforcement Learning Control of a Physical Robot Device for Assisted Human Walking without a Simulator

Abstract:
This study presents an innovative reinforcement learning (RL) control approach to facilitate soft exosuit-assisted human walking. Our goal is to address the ongoing challenges in developing reliable RL-based methods for controlling physical devices. To overcome key obstacles—such as limited data, the absence of a simulator for human-robot interaction during walking, the need for low computational overhead in real-time deployment, and the demand for rapid adaptation to achieve personalized control while ensuring human safety—we propose an online Adaptation from an offline Imitating Expert Policy (AIP) approach. Our offline learning mimics human expert actions through real human walking demonstrations without robot assistance. The resulted policy is then used to initialize online actor-critic learning, the goal of which is to optimally personalize robot assistance. In addition to being fast and robust, our online RL method also posses important properties such as learning convergence, dynamic stability, and solution optimality. We have successfully demonstrated our simple and robust framework for safe robot control on all five tested human participants, without selectively presenting results. The qualitative performance guarantees provided by our online RL, along with the consistent experimental validation of AIP control, represent the first demonstration of online adaptation for softsuit control personalization and serve as important evidence for the use of online RL in controlling a physical device to solve a real-life problem.

Paperid:2964 poster

Authors:Fan Shi, Bin Li, Xiangyang Xue

Title: Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning

Abstract:
Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios. Designing intelligent systems with human-like AVR abilities has been a long-standing topic in the artificial intelligence community. Deep AVR solvers have recently achieved remarkable success in various AVR tasks. However, they usually use task-specific designs or parameters in different tasks. In such a paradigm, solving new tasks often means retraining the model, and sometimes retuning the model architectures, which increases the cost of solving AVR problems. In contrast to task-specific approaches, this paper proposes a novel Unified Conditional Generative Solver (UCGS), aiming to address multiple AVR tasks in a unified framework. First, we prove that some well-known AVR tasks can be reformulated as the problem of estimating the predictability of target images in problem panels. Then, we illustrate that, under the proposed framework, training one conditional generative model can solve various AVR tasks. The experiments show that with a single round of multi-task training, UCGS demonstrates abstract reasoning ability across various AVR tasks. Especially, UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen AVR tasks in the testing phase.

Paperid:2965 poster

Authors:Chen Zeno, Hila Manor, Greg Ongie, Nir Weinberger, Tomer Michaeli, Daniel Soudry

Title: When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets

Abstract:
While diffusion models generate high-quality images via probability flow, the theoretical understanding of this process remains incomplete. A key question is when probability flow converges to training samples or more general points on the data manifold. We analyze this by studying the probability flow of shallow ReLU neural network denoisers trained with minimal $\ell^2$ norm. For intuition, we introduce a simpler score flow and show that for orthogonal datasets, both flows follow similar trajectories, converging to a training point or a sum of training points. However, early stopping by the diffusion time scheduler allows probability flow to reach more general manifold points. This reflects the tendency of diffusion models to both memorize training samples and generate novel points that combine aspects of multiple samples, motivating our study of such behavior in simplified settings. We extend these results to obtuse simplex data and, through simulations in the orthogonal case, confirm that probability flow converges to a training point, a sum of training points, or a manifold point. Moreover, memorization decreases when the number of training samples grows, as fewer samples accumulate near training points.

Paperid:2966 poster

Authors:Jinho Chang, Jong Chul Ye

Title: LDMol: A Text-to-Molecule Diffusion Model with Structurally Informative Latent Space Surpasses AR Models

Abstract:
With the emergence of diffusion models as a frontline generative model, many researchers have proposed molecule generation techniques with conditional diffusion models. However, the unavoidable discreteness of a molecule makes it difficult for a diffusion model to connect raw data with highly complex conditions like natural language. To address this, here we present a novel latent diffusion model dubbed LDMol for text-conditioned molecule generation. By recognizing that the suitable latent space design is the key to the diffusion model performance, we employ a contrastive learning strategy to extract novel feature space from text data that embeds the unique characteristics of the molecule structure. Experiments show that LDMol outperforms the existing autoregressive baselines on the text-to-molecule generation benchmark, being one of the first diffusion models that outperforms autoregressive models in textual data generation with a better choice of the latent domain. Furthermore, we show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-guided molecule editing, demonstrating its versatility as a diffusion model.

Paperid:2967 poster

Authors:Xize Liang, Lin Yang, Jie Wang, Yiyang Lu, Runyu Wu, Hanzhu Chen, Jianye HAO

Title: Boosting Multi-Domain Fine-Tuning of Large Language Models through Evolving Interactions between Samples

Abstract:
The multi-domain fine-tuning of large language models (LLMs) confronts a notorious trade-off among abilities across domains. Existing studies attribute this trade-off to the conflicts between samples rooted in inherent semantics. Recent approaches attempt to mitigate these conflicts through the empirical investigation or heuristic strategies. However, without a fundamental understanding of interactions between samples, they yield only marginal improvements, while incurring substantial trial-and-error costs. To address this challenge, we move beyond empirical studies by modeling interactions between samples as their influence on each other's loss, estimated using gradients. Intriguingly, we find that these interactionsevolve throughout trainingrather than being purely determined by inherent semantics. Building on this insight, we proposeEVolvingInteraction-guidedCurriculum (EVIC), which iteratively selects samples that positively influence the overall dataset for training. By dynamically adapting the training curriculum to prioritize samples that contribute the most to the model training, EVIC effectively mitigates conflicts and improves the sample efficiency. Extensive experiments on a mixed dataset covering coding, math, and general tasks with several model architectures show that EVIC significantly outperforms all baselines across diverse capabilities.

Paperid:2968 poster

Authors:Jie Bao, Chuangyin Dang, Rui Luo, Hanwei Zhang, Zhixin Zhou

Title: Enhancing Adversarial Robustness with Conformal Prediction: A Framework for Guaranteed Model Reliability

Abstract:
As deep learning models are increasingly deployed in high-risk applications, robust defenses against adversarial attacks and reliable performance guarantees become paramount. Moreover, accuracy alone does not provide sufficient assurance or reliable uncertainty estimates for these models. This study advances adversarial training by leveraging principles from Conformal Prediction. Specifically, we develop an adversarial attack method, termed OPSA (OPtimal Size Attack), designed to reduce the efficiency of conformal prediction at any significance level by maximizing model uncertainty without requiring coverage guarantees. Correspondingly, we introduce OPSA-AT (Adversarial Training), a defense strategy that integrates OPSA within a novel conformal training paradigm. Experimental evaluations demonstrate that our OPSA attack method induces greater uncertainty compared to baseline approaches for various defenses. Conversely, our OPSA-AT defensive model significantly enhances robustness not only against OPSA but also other adversarial attacks, and maintains reliable prediction. Our findings highlight the effectiveness of this integrated approach for developing trustworthy and resilient deep learning models for safety-critical domains. Our code is available athttps://github.com/bjbbbb/Enhancing-Adversarial-Robustness-with-Conformal-Prediction.

Paperid:2969 poster

Authors:Jan Blechschmidt, Tom-Christian Riemer, Max Winkler, Martin STOLL, Jan-Frederik Pietschmann

Title: Physics-Informed DeepONets for drift-diffusion on metric graphs: simulation and parameter identification

Abstract:
We develop a novel physics informed deep learning approach for solving nonlinear drift-diffusion equations on metric graphs. These models represent an important model class with a large number of applications in areas ranging from transport in biological cells to the motion of human crowds. While traditional numerical schemes require a large amount of tailoring, especially in the case of model design or parameter identification problems, physics informed deep operator networks (DeepONets) have emerged as a versatile tool for the solution of partial differential equations with the particular advantage that they easily incorporate parameter identification questions. We here present an approach where we first learn three DeepONet models for representative inflow, inner and outflow edges, resp., and then subsequently couple these models for the solution of the drift-diffusion metric graph problem by relying on an edge-based domain decomposition approach. We illustrate that our framework is applicable for the accurate evaluation of graph-coupled physics models and is well suited for solving optimization or inverse problems on these coupled networks.

Paperid:2970 poster

Authors:Zitong Shi, Guancheng Wan, Wenke Huang, Guibin Zhang, He Li, Carl Yang, Mang Ye

Title: EAGLES: Towards Effective, Efficient, and Economical Federated Graph Learning via Unified Sparsification

Abstract:
Federated Graph Learning (FGL) has gained significant attention as a privacy-preserving approach to collaborative learning, but the computational demands increase substantially as datasets grow and Graph Neural Network (GNN) layers deepen. To address these challenges, we propose $\textbf{EAGLES}$, a unified sparsification framework. EAGLES applies client-consensus parameter sparsification to generate multiple unbiased subnetworks at varying sparsity levels, reducing the need for iterative adjustments and mitigating performance degradation. In the graph structure domain, we introduced a dual-expert approach: a $\textit{graph sparsification expert}$ uses multi-criteria node-level sparsification, and a $\textit{graph synergy expert}$ integrates contextual node information to produce optimal sparse subgraphs. Furthermore, the framework introduces a novel distance metric that leverages node contextual information to measure structural similarity among clients, fostering effective knowledge sharing. We also introduce the $\textbf{Harmony Sparsification Principle}$, EAGLES balances model performance with lightweight graph and model structures. Extensive experiments demonstrate its superiority, achieving competitive performance on various datasets, such as reducing training FLOPS by 82% $\downarrow$ and communication costs by 80% $\downarrow$ on the ogbn-proteins dataset, while maintaining high performance.

Paperid:2971 poster

Authors:Rei Higuchi, Taiji Suzuki

Title: Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models

Abstract:
Aligning large language models (LLMs) with human preferences is crucial for safe deployment, yet existing methods assume specific preference models like Bradley-Terry model. This assumption leads to statistical inconsistency, where more data doesn't guarantee convergence to true human preferences. To address this critical gap, we introduce a novel alignment method Direct Density Ratio Optimization (DDRO). DDRO directly estimates the density ratio between preferred and unpreferred output distributions, circumventing the need for explicit human preference modeling. We theoretically prove that DDRO is statistically consistent, ensuring convergence to the true preferred distribution as the data size grows, regardless of the underlying preference structure. Experiments demonstrate that DDRO achieves superior performance compared to existing methods, showcasing its effectiveness and potential for significant improvement. DDRO unlocks the potential for truly data-driven alignment, paving the way for more reliable and human-aligned LLMs.

Paperid:2972 poster

Authors:Lele Fu, Bowen Deng, Sheng Huang, Tianchi Liao, Shirui Pan, Chuan Chen

Title: Less is More: Federated Graph Learning with Alleviating Topology Heterogeneity from A Causal Perspective

Abstract:
Federated graph learning (FGL) aims to collaboratively train a global graph neural network (GNN) on multiple private graphs with preserving the local data privacy. Besides the common cases of data heterogeneity in conventional federated learning, FGL faces the unique challenge of topology heterogeneity. Most of existing FGL methods alleviate the negative impact of heterogeneity by introducing global signals. However, the manners of creating increments might not be effective and significantly increase the computation amount. In light of this, we propose the FedATH, an FGL method with Alleviating Topology Heterogeneity from a causal perspective. Inspired by the causal theory, we argue that not all edges in a topology are necessary for the training objective, less topology information might make more sense. With the aid of edge evaluator, the local graphs are divided into causal and biased subgraphs. A dual-GNN architecture is used to encode the two subgraphs into corresponding representations. Thus, the causal representations are drawn closer to the training objective while the biased representations are pulled away from it. Further, the Hilbert-Schmidt Independence Criterion is employed to strengthen the separability of the two subgraphs. Extensive experiments on six real-world graph datasets are conducted to demonstrate the superiority of the proposed FedATH over the compared approaches.

Paperid:2973 poster

Authors:Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada

Title: Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Abstract:
For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality.

Paperid:2974 poster

Authors:Bokun Wang, Tianbao Yang

Title: A Near-Optimal Single-Loop Stochastic Algorithm for Convex Finite-Sum Coupled Compositional Optimization

Abstract:
This paper studies a class of convex Finite-sum Coupled Compositional Optimization (cFCCO) problems with applications including group distributionally robust optimization (GDRO) and learning with imbalanced data. To better address these problems, we introduce an efficient single-loop primal-dual block-coordinate stochastic algorithm called ALEXR. The algorithm employs block-coordinate stochastic mirror ascent with extrapolation for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we derive lower complexity bounds, demonstrating the (near-)optimality of ALEXR within a broad class of stochastic algorithms for cFCCO. Experimental results on GDRO and partial Area Under the ROC Curve (pAUC) maximization demonstrate the promising performance of our algorithm.

Paperid:2975 poster

Authors:Jiashu He, Mingyu Derek Ma, Jinxuan Fan, Dan Roth, Wei Wang, Alejandro Ribeiro

Title: GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation

Abstract:
Existing approaches based on context prompting or reinforcement learning (RL) to improve the reasoning capacities of large language models (LLMs) depend on the LLMs' internal knowledge to produce reliable Chain-Of-Thought (CoT). However, no matter the size of LLMs, certain problems cannot be resolved in a single forward pass. Meanwhile, agent-based reasoning systems require access to a comprehensive nonparametric knowledge base, which is often costly or not feasible for use in scientific and niche domains. We present Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data ($\textbf{observe}$), engage in query-specific associative thinking ($\textbf{reflect}$), and then synthesize this information to produce the final output ($\textbf{speak}$). Extensive experiments demonstrated the following benefits of our framework: (1) GIVE increases the performance of LLMs across various sizes. (2) In some scenarios, GIVE allows smaller LLMs to surpass larger, more sophisticated ones in scientific tasks ($\textbf{GPT3.5T + GIVE > GPT4}$). (3) GIVE is effective on scientific and open-domain assessments. (4) GIVE is a training-free method that enables LLMs to tackle new problems that extend beyond their training data (up to $\textbf{43.5}$% $\rightarrow$ $\textbf{88.2}$% accuracy improvement). (5) GIVE allows LLM agents to reason using both restricted (very small) and noisy (very large) knowledge sources, accommodating knowledge graphs (KG) ranging from $\textbf{135}$ to more than $\textbf{840k}$ nodes. (6) The reasoning process involved in GIVE is fully interpretable. Our code is available athttps://github.com/Jason-Tree/GIVE

Paperid:2976 poster

Authors:Yue Dai, Liang Liu, Xulong Tang, Youtao Zhang, Jun Yang

Title: MemFreezing: A Novel Adversarial Attack on Temporal Graph Neural Networks under Limited Future Knowledge

Abstract:
Temporal graph neural networks (TGNN) have achieved significant momentum in many real-world dynamic graph tasks. While most existing TGNN attack methods assume worst-case scenarios where attackers have complete knowledge of the input graph, the assumption may not always hold in real-world situations, where attackers can, at best, access information about existing nodes and edges but not future ones after the attack. However, studying adversarial attacks under these constraints is crucial, as limited future knowledge can reveal TGNN vulnerabilities overlooked in idealized settings. Nevertheless, designing effective attacks in such scenarios is challenging: the evolving graph can weaken their impact and make it hard to affect unseen nodes. To address these challenges, we introduce MemFreezing, a novel adversarial attack framework that delivers long-lasting and spreading disruptions in TGNNs without requiring post-attack knowledge of the graph. MemFreezing strategically injects fake nodes or edges to push node memories into a stable “frozen state,�?reducing their responsiveness to subsequent graph changes and limiting their ability to convey meaningful information. As the graph evolves, these affected nodes maintain and propagate their frozen state through their neighbors. Experimental results show that MemFreezing persistently degrades TGNN performance across various tasks, offering a more enduring adversarial strategy under limited future knowledge.

Paperid:2977 poster

Authors:Yiting He, Zhishuai Liu, Weixin Wang, Pan Xu

Title: Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction

Abstract:
Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs withf-divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.

Paperid:2978 poster

Authors:Qianglin Wen, Chengchun Shi, YingYang, Niansheng Tang, Hongtu Zhu

Title: Unraveling the Interplay between Carryover Effects and Reward Autocorrelations in Switchback Experiments

Abstract:
A/B testing has become the gold standard for modern technological industries for policy evaluation. Motivated by the widespread use of switchback experiments in A/B testing, this paper conducts a comprehensive comparative analysis of various switchback designs in Markovian environments. Unlike many existing works which derive the optimal design based on specific and relatively simple estimators, our analysis covers a range of state-of-the-art estimators developed in the reinforcement learning literature. It reveals that the effectiveness of different switchback designs depends crucially on (i) the size of the carryover effect and (ii) the autocorrelations among reward errors over time. Meanwhile, these findings are estimator-agnostic, i.e., they apply to all the aforementioned estimators. Based on these insights, we provide a workflow to offer guidelines for practitioners on designing switchback experiments in A/B testing.

Paperid:2979 poster

Authors:Mason O. Smith, Wenlong Zhang

Title: Risk-Sensitive Theory of Mind: Coordinating with Agents of Unknown Bias using Cumulative Prospect Theory

Abstract:
Humans are often modeled as rational actors by interactive agents when they are in fact frequently observed to make biased decisions. This erroneous assumption may cause an agent’s model of the human to fail, especially when interaction occurs in bias-inducing settings that prompt risky decisions. To address this, this paper formulates a risk-sensitive multi-agent coordination problem and presents the novel Risk-Sensitive Theory of Mind (RS-ToM) framework that allows an autonomous agent to reason about and adapt to a partner of unknown risk-sensitivity. In simulated studies, we show that an agent with an RS-ToM is able to better coordinate with such a partner when compared to an agent that assumes their partner is rational. Thus, we observe significant improvements to team performance, coordination fluency, compliance with partner risk-preferences, and predictability. The presented results suggest that an RS-ToM will be able to model and plan with partners that exhibit these risk-sensitive biases in the real world.

Paperid:2980 poster

Authors:Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois Knoll

Title: TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Abstract:
We present TUMTraf VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraf VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding—within a cohesive evaluation framework. We further introduce the TraffiX-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset’s complexity, highlight the limitations of existing models, and position TUMTraf VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.

Paperid:2981 poster

Authors:Yangxu Liao, Wenke Huang, Guancheng Wan, Jian Liang, Bin Yang, Mang Ye

Title: Splitting with Importance-aware Updating for Heterogeneous Federated Learning with Large Language Models

Abstract:
Federated learning provides an efficient privacy-preserving distributed training framework for large language models, addressing the growing scarcity of publicly available training data while enabling the utilization of private datasets. While integrating large language model fine-tuning with federated learning emerges as a promising research direction, researchers pay limited attention to non-IID instruction-following scenarios. Our key insight is decomposing client updates into consensus and divergence components, enabling the model to maintain core capabilities while adapting to domain-specific knowledge. We propose a novel federated learning framework calledFedICU(Splitting withImportanCe-awareUpdating for HeterogeneousFederated Learning with Large Language Models), which introduces an aggregation mechanism that dynamically balances these components based on their contribution to global model performance, while implementing an importance-aware parameter updating strategy to prevent catastrophic forgetting and domain overfitting. Extensive experiments across diverse domains demonstrate that FedICU significantly outperforms existing federated learning approaches in terms of both generalization performance and domain adaptation. Our code is available athttps://github.com/liaosunny123/FedICU.

Paperid:2982 poster

Authors:Angel Villar-Corrales, Sven Behnke

Title: PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning

Abstract:
Predicting future scene representations is a crucial task for enabling robots to understand and interact with the environment. However, most existing methods rely on videos and simulations with precise action annotations, limiting their ability to leverage the large amount of avail- able unlabeled video data. To address this challenge, we propose PlaySlot, an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. It then uses these representations to forecast future object states and video frames. PlaySlot allows the generation of multiple possible futures conditioned on latent actions, which can be inferred from video dynamics, provided by a user, or generated by a learned action policy, thus enabling versatile and interpretable world modeling. Our results show that PlaySlot outperforms both stochastic and object-centric baselines for video prediction across different environments. Furthermore, we show that our inferred latent actions can be used to learn robot behaviors sample-efficiently from unlabeled video demonstrations. Videos and code are available on our project website.

Paperid:2983 poster

Authors:Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

Title: From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Abstract:
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce \textbf{\emph{LIFT-GS}}, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7% mAP on open-vocabulary instance segmentation (vs. 20.2% prior SOTA) and consistent 10-30% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2×, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: \url{https://liftgs.github.io}.

Paperid:2984 poster

Authors:Haoming Yang, Ke Ma, Xiaojun Jia, Yingfei Sun, Qianqian Xu, Qingming Huang

Title: Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs

Abstract:
Despite the remarkable performance of Large Language Models (\textbf{LLMs}), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, \textbf{ICRT}, inspired by heuristics and biases in human cognition. Leveraging the \textit{simplicity effect}, we employ \textit{cognitive decomposition} to reduce the complexity of malicious prompts. Simultaneously, \textit{relevance bias} is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream \textbf{LLMs}' safety mechanisms and generates high-risk content.

Paperid:2985 poster

Authors:Aviv Bick, Eric P. Xing, Albert Gu

Title: Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism

Abstract:
State-space models (SSMs) offer efficient alternatives to Transformers for long sequences, but their fixed-size recurrent state limits capability on algorithmic tasks, such as retrieving past context. In this work, we examine how in-context retrieval operates in Transformer- and SSM-based language models and find that both rely on a Gather-and-Aggregate (G&A) mechanism: a Gather Head extracts relevant information from context, which an Aggregate Head integrates into representation. In both architectures, G&A concentrates in a few heads, forming bottlenecks even for simple retrieval. For example, disabling a single Gather or Aggregate Head in a pruned Llama-3.1-8B impairs retrieving the correct answer letter in MMLU, reducing its accuracy from 66% to 25%. Moreover, this retrieval bottleneck can obscure knowledge demands of tasks as the pruned model succeeds on MMLU with functioning G&A heads yet fails on other knowledge benchmarks. The bottleneck similarly extends to tasks where SSMs typically underperform, like GSM8K, BBH, and dialogue. We show that SSMs' retrieval challenges manifest in these heads, creating smoother attention patterns instead of the sharp transitions effective G&A requires. Thus, the Transformer-SSM retrieval gap exists in just a few heads, rather than the entire language model. % Result 3: Analyzing Hybrid models This suggests a unified explanation for Transformer vs. SSM performance gap while showing how to merge their strengths. We find that pretrained hybrid models, where SSMs are combined with attention layers, delegate the role of Aggregate Heads to attention. Similarly, replacing a single G&A head in a pretrained SSM with an attention variant boosts retrieval and benchmark scores.

Paperid:2986 poster

Authors:Yunshu Dai, Jianwei Fei, Fangjun Huang, Chip Hong Chang

Title: Robust Secure Swap: Responsible Face Swap With Persons of Interest Redaction and Provenance Traceability

Abstract:
As AI generative models evolve, face swap technology has become increasingly accessible, raising concerns over potential misuse. Celebrities may be manipulated without consent, and ordinary individuals may fall victim to identity fraud. To address these threats, we propose Secure Swap, a method that protects persons of interest (POI) from face-swapping abuse and embeds a unique, invisible watermark into nonPOI swapped images for traceability. By introducing an ID Passport layer, Secure Swap redacts POI faces and generates watermarked outputs for nonPOI. A detachable watermark encoder and decoder are trained with the model to ensure provenance tracing. Experimental results demonstrate that Secure Swap not only preserves face swap functionality but also effectively prevents unauthorized swaps of POI and detects different embedded model's watermarks with high accuracy. Specifically, our method achieves a 100% success rate in protecting POI and over 99% watermark extraction accuracy for nonPOI. Besides fidelity and effectiveness, the robustness of protected models against image-level and model-level attacks in both online and offline application scenarios is also experimentally demonstrated.

Paperid:2987 poster

Authors:Xinting Liao, Weiming Liu, Jiaming Qian, Pengyang Zhou, Jiahe Xu, Wenjie Wang, Chaochao Chen, Xiaolin Zheng, Tat-Seng Chua

Title: FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models

Abstract:
Federated prompt learning (FPL) for vision-language models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distributionally robust optimization. Additionally, FOCoOp improves the discrimination consistency among clients, i.e., calibrating global prompts, seemly OOD prompts, and OOD prompts by Semi-unbalanced optimal transport. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts. The project is available at GitHub.

Paperid:2988 poster

Authors:Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, Ling Liu

Title: Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

Abstract:
Safety aligned Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- a few harmful data mixed in the fine-tuning dataset can break the LLMs's safety alignment. While several defenses have been proposed, our evaluation shows that existing defenses fail \textit{when some specific training hyper-parameters are chosen} -- a large learning rate or a large number of training epochs in the fine-tuning stage can easily invalidate the defense. To this end, we propose Antidote, a post-fine-tuning stage solution, which remains \textbf{\textit{agnostic to the training hyper-parameters in the fine-tuning stage}}. Antidote relies on the philosophy that by removing the harmful parameters, the harmful model can be recovered from the harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. With this philosophy, we introduce a one-shot pruning stage after harmful fine-tuning to remove the harmful weights that are responsible for the generation of harmful content. Despite its embarrassing simplicity, empirical results show that Antidote can reduce harmful score while maintaining accuracy on downstream tasks.

Paperid:2989 poster

Authors:Yilin wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, LEI JIA, Yuanxiang Li, zhongyu wei

Title: ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset

Abstract:
Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at:https://pandalin98.github.io/itformer_site/.

Paperid:2990 poster

Authors:Xuelin Shen, Jiayin Xu, Kangsheng Yin, Wenhan Yang

Title: Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models

Abstract:
The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users' privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model's uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.

Paperid:2991 poster

Authors:Jindong Tong, Hongcheng Liu, Johannes O. Royset

Title: Non-Asymptotic and Non-Lipschitzian Bounds on Optimal Values in Stochastic Optimization Under Heavy Tails

Abstract:
This paper focuses on non-asymptotic confidence bounds (CB) for the optimal values of stochastic optimization (SO) problems. Existing approaches often rely on two conditions that may be restrictive: The need for a global Lipschitz constant and the assumption of light-tailed distributions. Beyond either of the conditions, it remains largely unknown whether computable CBs can be constructed. In view of this literature gap, we provide three key findings below: (i) Based on the conventional formulation of sample average approximation (SAA), we derive non-Lipschitzian CBs for convex SP problems under heavy tails. (ii) We explore diametrical risk minimization (DRM)---a recently introduced modification to SAA---and attain non-Lipschitzian CBs for nonconvex SP problems in light-tailed settings. (iii) We extend our analysis of DRM to handle heavy-tailed randomness by utilizing properties in formulations for training over-parameterized classification models.

Paperid:2992 poster

Authors:Ren-Biao Liu, Anqi Li, Chaoding Yang, Hui Sun, Ming Li

Title: Revisiting Chain-of-Thought in Code Generation: Do Language Models Need to Learn Reasoning before Coding?

Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance in code generation, becoming increasingly vital for software engineering and development. Recently, Chain-of-Thought (CoT) has proven effective for complex tasks by prompting LLMs to reason step-by-step and provide a final answer. However, research onhow LLMs learn to reason with CoT data for code generationremains limited. In this work, we revisit classic CoT training, which typically learns reasoning steps before the final answer. We synthesize a dataset to separate the CoT process from code solutions and then conduct extensive experiments to study how CoT works in code generation empirically. We observe counterintuitive phenomena, suggesting that the traditional training paradigm may not yield benefits for code generation. Instead, training LLMs to generate code first and then output the CoT to explain reasoning steps for code generation is more effective. Specifically, our results indicate that a 9.86% relative performance improvement can be achieved simply by changing the order between CoT and code. Our findings provide valuable insights into leveraging CoT to enhance the reasoning capabilities of CodeLLMs and improve code generation.

Paperid:2993 poster

Authors:Andreas Kontogiannis, Konstantinos Papathanasiou, Yi Shen, Giorgos Stamou, Michael M. Zavlanos, George Vouros

Title: Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration

Abstract:
Learning to cooperate in distributed partially observable environments with no communication abilities poses significant challenges for multi-agent deep reinforcement learning (MARL). This paper addresses key concerns in this domain, focusing on inferring state representations from individual agent observations and leveraging these representations to enhance agents' exploration and collaborative task execution policies. To this end, we propose a novel state modelling framework for cooperative MARL, where agents infer meaningful belief representations of the non-observable state, with respect to optimizing their own policies, while filtering redundant and less informative joint state information. Building upon this framework, we propose the MARL SMPE$^2$ algorithm. In SMPE$^2$, agents enhance their own policy's discriminative abilities under partial observability, explicitly by incorporating their beliefs into the policy network, and implicitly by adopting an adversarial type of exploration policies which encourages agents to discover novel, high-value states while improving the discriminative abilities of others. Experimentally, we show that SMPE$^2$ outperforms a plethora of state-of-the-art MARL algorithms in complex fully cooperative tasks from the MPE, LBF, and RWARE benchmarks.

Paperid:2994 poster

Authors:Piyush Tiwary, Kinjawl Bhattacharyya, Prathosh AP

Title: LangDAug: Langevin Data Augmentation for Multi-Source Domain Generalization in Medical Image Segmentation

Abstract:
Medical image segmentation models often struggle to generalize across different domains due to various reasons. Domain Generalization (DG) methods overcome this either through representation learning or data augmentation (DA). While representation learning methods seek domain-invariant features, they often rely on ad-hoc techniques and lack formal guarantees. DA methods, which enrich model representations through synthetic samples, have shown comparable or superior performance to representation learning approaches. We propose LangDAug, a novelLangevinDataAugmentation for multi-source domain generalization in 2D medical image segmentation. LangDAug leverages Energy-Based Models (EBMs) trained via contrastive divergence to traverse between source domains, generating intermediate samples through Langevin dynamics. Theoretical analysis shows that LangDAug induces a regularization effect, and for GLMs, it upper-bounds the Rademacher complexity by the intrinsic dimensionality of the data manifold. Through extensive experiments on Fundus segmentation and 2D MRI prostate segmentation benchmarks, we show that LangDAug outperforms state-of-the-art domain generalization methods and effectively complements existing domain-randomization approaches. The codebase for our method is available athttps://github.com/backpropagator/LangDAug.

Paperid:2995 poster

Authors:Amirhossein Roknilamouki, Arnob Ghosh, Ming Shi, Fatemeh Nourzad, Eylem Ekici, Ness Shroff

Title: Provably Efficient RL for Linear MDPs under Instantaneous Safety Constraints in Non-Convex Feature Spaces

Abstract:
In Reinforcement Learning (RL), tasks with instantaneous hard constraints present significant challenges, particularly when the decision space is non-convex or non-star-convex. This issue is especially relevant in domains like autonomous vehicles and robotics, where constraints such as collision avoidance often take a non-convex form. In this paper, we establish a regret bound of $\tilde{\mathcal{O}}((1 + \tfrac{1}{\tau}) \sqrt{\log(\frac{1}{\tau}) d^3 H^4 K})$, applicable to both star-convex and non-star-convex cases, where $d$ is the feature dimension, $H$ the episode length, $K$ the number of episodes, and $\tau$ the safety threshold. Moreover, the violation of safety constraints is zero with high probability throughout the learning process. A key technical challenge in these settings is bounding the covering number of the value-function class, which is essential for achieving value-aware uniform concentration in model-free function approximation. For the star-convex setting, we develop a novel technique calledObjective–Constraint Decomposition(OCD) to properly bound the covering number. This result also resolves an error in a previous work on constrained RL. In non-star-convex scenarios, where the covering number can become infinitely large, we propose a two-phase algorithm, Non-Convex Safe Least Squares Value Iteration (NCS-LSVI), which first reduces uncertainty about the safe set by playing a known safe policy. After that, it carefully balances exploration and exploitation to achieve the regret bound. Finally, numerical simulations on an autonomous driving scenario demonstrate the effectiveness of NCS-LSVI.

Paperid:2996 poster

Authors:chao ying, Jun Jin, Yi Guo, Xiudi Li, Muxuan Liang, Jiwei Zhao

Title: Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift

Abstract:
Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative. However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions. Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework. We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition, we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion. Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach.

Paperid:2997 poster

Authors:Harsh Goel, Mohammad Omama, Behdad Chalaki, Vaishnav Tadiparthi, Ehsan Moradi Pari, Sandeep P. Chinchali

Title: R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning

Abstract:
Multi-agent reinforcement learning (MARL) has achieved significant progress in large-scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role-based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent's past experience during training, neglecting their influence on its future trajectories. This paper introduces a key insight: an agent’s role should shape its future behavior to enable effective coordination. Hence, we propose Role Discovery and Diversity through Dynamics Models (R3DM), a novel role-based MARL framework that learns emergent roles by maximizing the mutual information between agents' roles, observed trajectories, and expected future behaviors. R3DM optimizes the proposed objective through contrastive learning on past trajectories to first derive intermediate roles that shape intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model. Benchmarking on SMAC and SMACv2 environments demonstrates that R3DM outperforms state-of-the-art MARL approaches, improving multi-agent coordination to increase win rates by up to 20%. The code is available athttps://github.com/UTAustin-SwarmLab/R3DM.

Paperid:2998 poster

Authors:Suryanarayana Sankagiri, Jalal Etesami, Matthias Grossglauser

Title: Recommendations with Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization

Abstract:
In this paper, we consider a recommender system that elicits user feedback through pairwise comparisons instead of ratings. We study the problem of learning personalised preferences from such comparison data via collaborative filtering. Similar to the classical matrix completion setting, we assume that users and items are endowed with low-dimensional latent features. These features give rise to user-item utilities, and the comparison outcomes are governed by a discrete choice model over these utilities. The task of learning these features is then formulated as a maximum likelihood problem over the comparison dataset. Despite the resulting optimization problem being nonconvex, we show that gradient-based methods converge exponentially to the latent features, given a warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend key concentration results commonly used in matrix completion to our model. Simulations reveal that the empirical performance of the method exceeds theoretical predictions, even when some assumptions are relaxed. Our work demonstrates that learning personalised recommendations from comparison data is both computationally and statistically efficient.

Paperid:2999 poster

Authors:Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, Wenwu Zhu

Title: Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering

Abstract:
Multimodal Large Language Models (Multimodal LLMs) have shown their strength in Video Question Answering (VideoQA). However, due to the black-box nature of end-to-end training strategies, existing approaches based on Multimodal LLMs suffer from the lack of interpretability for VideoQA: they can neither present reasoning paths nor indicate where the answers are derived from the video. To address this issue, we proposeMSR-ViR(ModularizedSelf-ReflectedVideoReasoner), which for the first time integrates modular networks to Multimodal LLMs, capable of providing VideoQA with explicit reasoning paths for more interpretability. Specifically, aMoST-Grounding(Modularized Spatial-Temporal Grounding) network is proposed to decompose complex questions via tree-structured policies, localizing relevant temporal and spatial segments within videos through step-by-step reasoning. The proposed MoST-Grounding network provides explicit visually grounded information for Multimodal LLMs with clear reasoning paths, thus enhancing interpretability for the predicted answers. To further improve the reasoning quality, we design anAlternate Self-reflection Training Strategyto jointly optimize policy generation and Multimodal LLMs. Experiments on real-world datasets demonstrate the superiority of our proposed MSR-ViR framework in video understanding, reasoning transparency, and providing explicit localization evidence for answers.

Paperid:3000 poster

Authors:Myunsoo Kim, Hayeong Lee, Seong-Woong Shim, JunHo Seo, Byung-Jun Lee

Title: NBDI: A Simple and Effective Termination Condition for Skill Extraction from Task-Agnostic Demonstrations

Abstract:
Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, long-horizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential for further exploration and faster policy learning. In this work, we propose to learn a simple and effective termination condition that identifies decision points through a state-action novelty module that leverages agent experience data. Our approach, Novelty-based Decision Point Identification (NBDI), outperforms previous baselines in complex, long-horizon tasks, and remains effective even in the presence of significant variations in the environment configurations of downstream tasks, highlighting the importance of decision point identification in skill learning.

Paperid:3001 poster

Authors:Jae-Hong Lee

Title: Bayesian Weight Enhancement with Steady-State Adaptation for Test-time Adaptation in Dynamic Environments

Abstract:
Test-time adaptation (TTA) addresses the machine learning challenge of adapting models to unlabeled test data from shifting distributions in dynamic environments. A key issue in this online setting arises from using unsupervised learning techniques, which introduce explicit gradient noise that degrades model weights. To invest in weight degradation, we propose a Bayesian weight enhancement framework, which generalizes existing weight-based TTA methods that effectively mitigate the issue. Our framework enables robust adaptation to distribution shifts by accounting for diverse weights by modeling weight distributions. Building on our framework, we identify a key limitation in existing methods: their neglect of time-varying covariance reflects the influence of the gradient noise. To address this gap, we propose a novel steady-state adaptation (SSA) algorithm that balances covariance dynamics during adaptation. SSA is derived through the solution of a stochastic differential equation for the TTA process and online inference. The resulting algorithm incorporates a covariance-aware learning rate adjustment mechanism. Through extensive experiments, we demonstrate that SSA consistently improves state-of-the-art methods in various TTA scenarios, datasets, and model architectures, establishing its effectiveness in instability and adaptability.

Paperid:3002 poster

Authors:Fan Nie, Xiaotian Hou, Shuhang Lin, James Zou, Huaxiu Yao, Linjun Zhang

Title: FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

Abstract:
The propensity of large language models (LLMs) to generate hallucinations and non-factual content undermines their reliability in high-stakes domains, where rigorous control over Type I errors (the conditional probability of incorrectly classifying hallucinations as truthful content) is essential. Despite its importance, formal verification of LLM factuality with such guarantees remains largely unexplored. In this paper, we introduce FactTest, a novel framework that statistically assesses whether an LLM can provide correct answers to given questions with high-probability correctness guarantees. We formulate hallucination detection as a hypothesis testing problem to enforce an upper bound of Type I errors at user-specified significance levels. Notably, we prove that FactTest also ensures strong Type II error control under mild conditions and can be extended to maintain its effectiveness when covariate shifts exist. Our approach is distribution-free and works for any number of human-annotated samples. It is model-agnostic and applies to any black-box or white-box LM. Extensive experiments on question-answering (QA) benchmarks demonstrate that FactTest effectively detects hallucinations and enable LLMs to abstain from answering unknown questions, leading to an over 40% accuracy improvement.

Paperid:3003 poster

Authors:Tomoharu Iwata, Shinsaku Sakaue

Title: Learning to Generate Projections for Reducing Dimensionality of Heterogeneous Linear Programming Problems

Abstract:
We propose a data-driven method for reducing the dimensionality of linear programming problems (LPs) by generating instance-specific projection matrices using a neural network-based model. Once the model is trained using multiple LPs by maximizing the expected objective value, we can efficiently find high-quality feasible solutions of newly given LPs. Our method can shorten the computational time of any LP solvers due to its solver-agnostic nature, it can provide feasible solutions by relying on projection that reduces the number of variables, and it can handle LPs of different sizes using neural networks with permutation equivariance and invariance. We also provide a theoretical analysis of the generalization bound for learning a neural network to generate projection matrices that reduce the size of LPs. Our experimental results demonstrate that our method can obtain solutions with higher quality than the existing methods, while its computational time is significantly shorter than solving the original LPs.

Paperid:3004 poster

Authors:Yuqin Dai, Zhouheng Yao, Chunfeng Song, Qihao Zheng, Weijian Mai, Kunyu Peng, Shuai Lu, Wanli Ouyang, Jian Yang, Jiamin Wu

Title: MindAligner: Explicit Brain Functional Alignment for Cross-Subject Visual Decoding from Limited fMRI Data

Abstract:
Brain decoding aims to reconstruct visual perception of human subject from fMRI signals, which is crucial for understanding brain's perception mechanisms. Existing methods are confined to the single-subject paradigm due to substantial brain variability, which leads to weak generalization across individuals and incurs high training costs, exacerbated by limited availability of fMRI data. To address these challenges, we propose MindAligner, an explicit functional alignment framework for cross-subject brain decoding from limited fMRI data. The proposed MindAligner enjoys several merits. First, we learn a Brain Transfer Matrix (BTM) that projects the brain signals of an arbitrary new subject to one of the known subjects, enabling seamless use of pre-trained decoding models. Second, to facilitate reliable BTM learning, a Brain Functional Alignment module is proposed to perform soft cross-subject brain alignment under different visual stimuli with a multi-level brain alignment loss, uncovering fine-grained functional correspondences with high interpretability. Experiments indicate that MindAligner not only outperforms existing methods in visual decoding under data-limited conditions, but also provides valuable neuroscience insights in cross-subject functional analysis. The code will be made publicly available.

Paperid:3005 poster

Authors:Yang Li, Jiale Ma, Yebin Yang, Qitian Wu, Hongyuan Zha, Junchi Yan

Title: Generative Modeling Reinvents Supervised Learning: Label Repurposing with Predictive Consistency Learning

Abstract:
Predicting labels directly from data has been the standard in label learning tasks, e.g., supervised learning, where models often prioritize feature compression and extraction from inputs under the assumption that label information is less complex. However, recent prediction tasks often face predicting complex labels, exacerbating the challenge of learning mappings from learned features to high-fidelity label representations. To this end, we draw inspiration from the consistency training concept in generative consistency models and propose predictive consistency learning (PCL), a novel learning paradigm that decomposes the full label information into a progressive learning procedure, mitigating the label capture challenge. Besides data inputs, PCL additionally receives input from noise-perturbed labels as an additional reference, pursuing predictive consistency across different noise levels. It simultaneously learns the relationship between latent features and a spectrum of label information, which enables progressive learning for complex predictions and allows multi-step inference analogous to gradual denoising, thereby enhancing the prediction quality. Experiments on vision, text, and graph tasks show the superiority of PCL over conventional supervised training in complex label prediction tasks.

Paperid:3006 poster

Authors:Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao

Title: NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Abstract:
Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm—Next-Token-Pair Prediction (NTPP)—to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications. Demo and code can be found athttps://audio-3059.pages.dev.

Paperid:3007 poster

Authors:Qingtian Zhu, Yumin Zheng, Yuling Sang, Yifan Zhan, Ziyan Zhu, Jun Ding, Yinqiang Zheng

Title: SUICA: Learning Super-high Dimensional Sparse Implicit Neural Representations for Spatial Transcriptomics

Abstract:
Spatial Transcriptomics (ST) is a method that captures gene expression profiles aligned with spatial coordinates. The discrete spatial distribution and the super-high dimensional sequencing results make ST data challenging to be modeled effectively. In this paper, we manage to model ST in a continuous and compact manner by the proposed tool, SUICA, empowered by the great approximation capability of Implicit Neural Representations (INRs) that can enhance both the spatial density and the gene expression. Concretely within the proposed SUICA, we incorporate a graph-augmented Autoencoder to effectively model the context information of the unstructured spots and provide informative embeddings that are structure-aware for spatial mapping. We also tackle the extremely skewed distribution in a regression-by-classification fashion and enforce classification-based loss functions for the optimization of SUICA. By extensive experiments of a wide range of common ST platforms under varying degradations, SUICA outperforms both conventional INR variants and SOTA methods regarding numerical fidelity, statistical correlation, and bio-conservation. The prediction by SUICA also showcases amplified gene signatures that enriches the bio-conservation of the raw data and benefits subsequent analysis.

Paperid:3008 poster

Authors:Federico Errica, Henrik Christiansen, Viktor Zaverkin, Takashi Maruyama, Mathias Niepert, Francesco Alesiani

Title: Adaptive Message Passing: A General Framework to Mitigate Oversmoothing, Oversquashing, and Underreaching

Abstract:
Long-range interactions are essential for the correct description of complex systems in many scientific fields. The price to pay for including them in the calculations, however, is a dramatic increase in the overall computational costs. Recently, deep graph networks have been employed as efficient, data-driven models for predicting properties of complex systems represented as graphs. These models rely on a message passing strategy that should, in principle, capture long-range information without explicitly modeling the corresponding interactions. In practice, most deep graph networks cannot really model long-range dependencies due to the intrinsic limitations of (synchronous) message passing, namely oversmoothing, oversquashing, and underreaching. This work proposes a general framework that \textit{learns to mitigate} these limitations: within a variational inference framework, we endow message passing architectures with the ability to adapt their depth and filter messages along the way. With theoretical and empirical arguments, we show that this strategy better captures long-range interactions, by competing with the state of the art on five node and graph prediction datasets.

Paperid:3009 poster

Authors:Haonan Yuan, Qingyun Sun, Junhua Shi, Xingcheng Fu, Bryan Hooi, Jianxin Li, Philip S. Yu

Title: How Much Can Transfer? BRIDGE: Bounded Multi-Domain Graph Foundation Model with Generalization Guarantees

Abstract:
Graph Foundation Models hold significant potential for advancing multi-domain graph learning, yet their full capabilities remain largely untapped. Existing works show promising task performance with the “pretrain-then-prompt�?paradigm, which lacks theoretical foundations to understand why it works and how much knowledge can be transferred from source domains to the target. In this paper, we introduce BRIDGE, a bounded graph foundation model pre-trained on multi-domains with Generalization guarantees. To learn discriminative source knowledge, we align multi-domain graph features with domain-invariant aligners during pre-training. Then, a lightweight Mixture of Experts (MoE) network is proposed to facilitate downstream prompting through self-supervised selective knowledge assembly and transfer. Further, to determine the maximum amount of transferable knowledge, we derive an optimizable generalization error upper bound from a graph spectral perspective given the Lipschitz continuity. Extensive experiments demonstrate the superiority of BRIDGE on both node and graph classification compared with 15 state-of-the-art baselines.

Paperid:3010 poster

Authors:Yu Zhu, Chunfeng Song, Wanli Ouyang, Shan Yu, Tiejun Huang

Title: Neural Representational Consistency Emerges from Probabilistic Neural-Behavioral Representation Alignment

Abstract:
Individual brains exhibit striking structural and physiological heterogeneity, yet neural circuits can generate remarkably consistent functional properties across individuals, an apparent paradox in neuroscience. While recent studies have observed preserved neural representations in motor cortex through manual alignment across subjects, the zero-shot validation of such preservation and its generalization to more cortices remain unexplored. Here we present PNBA (Probabilistic Neural-Behavioral Representation Alignment), a new framework that leverages probabilistic modeling to address hierarchical variability across trials, sessions, and subjects, with generative constraints preventing representation degeneration. By establishing reliable cross-modal representational alignment, PNBA reveals robust preserved neural representations in monkey primary motor cortex (M1) and dorsal premotor cortex (PMd) through zero-shot validation. We further establish similar representational preservation in mouse primary visual cortex (V1), reflecting a general neural basis. These findings resolve the paradox of neural heterogeneity by establishing zero-shot preserved neural representations across cortices and species, enriching neural coding insights and enabling zero-shot behavior decoding.

Paperid:3011 poster

Authors:Yadong Sun, Xiaofeng Cao, Ivor Tsang, Heng Tao Shen

Title: Analytical Construction on Geometric Architectures: Transitioning from Static to Temporal Link Prediction

Abstract:
Static systems exhibit diverse structural properties, such as hierarchical, scale-free, and isotropic patterns, where different geometric spaces offer unique advantages. Methods combining multiple geometries have proven effective in capturing these characteristics. However, real-world systems often evolve dynamically, introducing significant challenges in modeling their temporal changes. To overcome this limitation, we propose a unified cross-geometric learning framework for dynamic systems, which synergistically integrates Euclidean and hyperbolic spaces, aligning embedding spaces with structural properties through fine-grained substructure modeling. Our framework further incorporates a temporal state aggregation mechanism and an evolution-driven optimization objective, enabling comprehensive and adaptive modeling of both nodal and relational dynamics over time. Extensive experiments on diverse real-world dynamic graph datasets highlight the superiority of our approach in capturing complex structural evolution, surpassing existing methods across multiple metrics.

Paperid:3012 poster

Authors:Rushuang Zhou, Yuanting Zhang, Yining Dong

Title: H-Tuning: Toward Low-Cost and Efficient ECG-based Cardiovascular Disease Detection with Pre-Trained Models

Abstract:
Fine-tuning large-scale pre-trained models provides an effective solution to alleviate the label scarcity problem in cardiovascular diseases (CVDs) detection using electrocardiogram (ECG). However, as the pre-trained models scale up, the computational costs for fine-tuning and inference become unaffordable on low-level devices deployed for clinical applications. Additionally, maintaining the model performance under low budgets in computational resources remains a significant challenge. However, a comprehensive study that can address them in a joint framework is still lacking. Here, we propose a holistic method (H-Tuning) for low-cost and efficient fine-tuning of pre-trained models on downstream datasets. Then, the inference costs of the models fine-tuned by H-Tuning are further reduced significantly using a knowledge distillation technique. Experiments on four ECG datasets demonstrate that H-Tuning reduces the GPU memory consumption during fine-tuning by 6.34 times while achieving comparable CVDs detection performance to standard fine-tuning. With the knowledge distillation technique, the model inference latency and the memory consumption are reduced by 4.52 times and 19.83 times. As such, the proposed joint framework allows for the utilization of pre-trained models with high computation efficiency and robust performance, exploring a path toward low-cost and efficient CVDs detection. Code is available athttps://github.com/KAZABANA/H-Tuning

Paperid:3013 poster

Authors:Angelica Chen, Samuel Don Stanton, Frances Ding, Robert G Alberstein, Andrew Martin Watkins, Richard Bonneau, Vladimir Gligorijevic, Kyunghyun Cho, Nathan C. Frey

Title: Generalists vs. Specialists: Evaluating LLMs on Highly-Constrained Biophysical Sequence Optimization Tasks

Abstract:
Although large language models (LLMs) have shown promise in biomolecule optimization problems, they incur heavy computational costs and struggle to satisfy precise constraints. On the other hand, specialized solvers like LaMBO-2 offer efficiency and fine-grained control but require more domain expertise. Comparing these approaches is challenging due to expensive laboratory validation and inadequate synthetic benchmarks. We address this by introducing Ehrlich functions, a synthetic test suite that captures the geometric structure of biophysical sequence optimization problems. With prompting alone, off-the-shelf LLMs struggle to optimize Ehrlich functions. In response, we propose LLOME (Language Model Optimization with Margin Expectation), a bilevel optimization routine for online black-box optimization. When combined with a novel preference learning loss, we find LLOME can not only learn to solve some Ehrlich functions, but can even perform as well as or better than LaMBO-2 on moderately difficult Ehrlich variants. However, LLMs also exhibit some likelihood-reward miscalibration and struggle without explicit rewards. Our results indicate LLMs can occasionally provide significant benefits, but specialized solvers are still competitive and incur less overhead.

Paperid:3014 poster

Authors:Karina Zainullina, Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Daria Litvintseva, Simon Karasik, Filipp Fisin, Sergei Skvortsov, Maksim Nekrashevich, Anton Shevtsov, Boris Yangel

Title: Guided Search Strategies in Non-Serializable Environments with Applications to Software Engineering Agents

Abstract:
Large language models (LLMs) have recently achieved remarkable results in complex multi-step tasks, such as mathematical reasoning and agentic software engineering. However, they often struggle to maintain consistent performance across multiple solution attempts. One effective approach to narrow the gap between average-case and best-case performance is guided test-time search, which explores multiple solution paths to identify the most promising one. Unfortunately, effective search techniques (e.g. MCTS) are often unsuitable fornon-serializableRL environments, such as Docker containers, where intermediate environment states cannot be easily saved and restored. We investigate two complementary search strategies applicable to such environments: 1-step lookahead and trajectory selection, both guided by a learned action-value function estimator. On the SWE-bench Verified benchmark, a key testbed for agentic software engineering, we find these methods to double the average success rate of a fine-tuned Qwen-72B model, achieving $40.8$%, the new state-of-the-art for open-weights models. Additionally, we show that these techniques are transferable to more advanced closed models, yielding similar improvements with GPT-4o.

Paperid:3015 poster

Authors:Dan-Xuan Liu, Chao Qian

Title: Improved Theoretically-Grounded Evolutionary Algorithms for Subset Selection with a Linear Cost Constraint

Abstract:
The subset selection problem with a monotone and submodular objective function under a linear cost constraint has wide applications, such as maximum coverage, influence maximization, and feature selection, just to name a few. Various greedy algorithms have been proposed with good performance both theoretically and empirically. Recently, evolutionary algorithms (EAs), inspired by Darwin's evolution theory, have emerged as a prominent methodology, offering both empirical advantages and theoretical guarantees. Among these, the multi-objective EA, POMC, has demonstrated the best empirical performance to date, achieving an approximation guarantee of $(1/2)(1-1/e)$. However, there remains a gap in the approximation bounds of EAs compared to greedy algorithms, and their full theoretical potential is yet to be realized. In this paper, we re-analyze the approximation performance of POMC theoretically, and derive an improved guarantee of $1/2$, which thus provides theoretical justification for its encouraging empirical performance. Furthermore, we propose a novel multi-objective EA, EPOL, which not only achieves the best-known practical approximation guarantee of $0.6174$, but also delivers superior empirical performance in applications of maximum coverage and influence maximization. We hope this work can help better solving the subset selection problem, but also enhance our theoretical understanding of EAs.

Paperid:3016 poster

Authors:Florian Beier, Moritz Piening, Robert Beinert, Gabriele Steidl

Title: Joint Metric Space Embedding by Unbalanced Optimal Transport with Gromov–Wasserstein Marginal Penalization

Abstract:
We propose a new approach for unsupervised alignment of heterogeneous datasets, which maps data from two different domains without any known correspondences to a common metric space. Our method is based on an unbalanced optimal transport problem with Gromov-Wasserstein marginal penalization. It can be seen as a counterpart to the recently introduced joint multidimensional scaling method. We prove that there exists a minimizer of our functional and that for penalization parameters going to infinity, the corresponding sequence of minimizers converges to a minimizer of the so-called embedded Wasserstein distance. Our model can be reformulated as a quadratic, multi-marginal, unbalanced optimal transport problem, for which a bi-convex relaxation admits a numerical solver via block-coordinate descent. We provide numerical examples for joint embeddings in Euclidean as well as non-Euclidean spaces.

Paperid:3017 poster

Authors:Mingjia Li, Hong Qian, Tian-Zuo Wang, ShujunLi, Min Zhang, Aimin Zhou

Title: Strong and Weak Identifiability of Optimization-based Causal Discovery in Non-linear Additive Noise Models

Abstract:
Causal discovery aims to identify causal relationships from observational data. Recently, optimization-based causal discovery methods have attracted extensive attention in the literature due to their efficiency in handling high-dimensional problems. However, we observe that optimization-based methods often perform well on certain problems but struggle with others. This paper identifies a specific characteristic of causal structural equations that determines the difficulty of identification in causal discovery and, in turn, the performance of optimization-based methods. We conduct an in-depth study of the additive noise model (ANM) and propose to further divide identifiable problems into strongly and weakly identifiable types based on the difficulty of identification. We also provide a sufficient condition to distinguish the two categories. Inspired by these findings, this paper further proposes GENE, a generic method for addressing strongly and weakly identifiable problems in a unified way under the ANM assumption. GENE adopts an order-based search framework that incorporates conditional independence tests into order fitness evaluation, ensuring effectiveness on weakly identifiable problems. In addition, GENE restricts the dimensionality of the effect variables to ensure \emph{scale invariance}, a property crucial for practical applications. Experiments demonstrate that GENE is uniquely effective in addressing weakly identifiable problems while also remaining competitive with state-of-the-art causal discovery algorithms for strongly identifiable problems.

Paperid:3018 poster

Authors:Rihong Qiu, Xinke Jiang, Yuchen Fang, Hongbin Lai, Hao Miao, Xu Chu, Junfeng Zhao, Yasha Wang

Title: Efficient Graph Continual Learning via Lightweight Graph Neural Tangent Kernels-based Dataset Distillation

Abstract:
Graph Neural Networks (GNNs) have emerged as a fundamental tool for modeling complex graph structures across diverse applications. However, directly applying pretrained GNNs to varied downstream tasks without fine-tuning-based continual learning remains challenging, as this approach incurs high computational costs and hinders the development of Large Graph Models (LGMs). In this paper, we investigate an efficient and generalizable dataset distillation framework for Graph Continual Learning (GCL) across multiple downstream tasks, implemented through a novel Lightweight Graph Neural Tangent Kernel (LIGHTGNTK). Specifically, LIGHTGNTK employs a low-rank approximation of the Laplacian matrix via Bernoulli sampling and linear association within the GNTK. This design enables efficient capture of both structural and feature relationships while supporting gradient-based dataset distillation. Additionally, LIGHTGNTK incorporates a unified subgraph anchoring strategy, allowing it to handle graph-level, node-level, and edge-level tasks under diverse input structures. Comprehensive experiments on several datasets show that LIGHTGNTK achieves state-of-the-art performance in GCL scenarios, promoting the development of adaptive and scalable LGMs.

Paperid:3019 poster

Authors:Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Jie Chen

Title: Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages

Abstract:
Recent data-efficient molecular generation approaches exploit graph grammars to introduce interpretability into the generative models. However, grammar learning therein relies on expert annotation or unreliable heuristics for algorithmic inference. We propose Foundation Molecular Grammar (FMG), which leverages multi-modal foundation models (MMFMs) to induce an interpretable molecular language. By exploiting the chemical knowledge of an MMFM, FMG renders molecules as images, describes them as text, and aligns information across modalities using prompt learning. FMG can be used as a drop-in replacement for the prior grammar learning approaches in molecular generation and property prediction. We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows. Code is available athttps://github.com/shiningsunnyday/induction.

Paperid:3020 poster

Authors:Ke Liang Xiao, Noah Marshall, Atish Agarwala, Elliot Paquette

Title: Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

Abstract:
In recent years, SignSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that SignSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of SignSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of SignSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

Paperid:3021 poster

Authors:Tian Jin, Ellie Y Cheng, Zachary Ankner, Nikunj Saunshi, Blake M Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

Title: Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Abstract:
Decoding with autoregressive language models traditionally occurs sequentially, generating one token after another. Recent attempts to introduce parallelism require a pre-determined structure in the generated content to implement parallel generation, such as by pattern-matching on bullet points. In this work, we present a new technique to automate parallel generation by dynamically exploiting the semantic independence of generation outputs to implement asynchronous decoding. We introduce an annotation language Pasta-Lang for language models to initiate asynchronous decoding at inference time. We also develop an accompanying Pasta-Lang interpreter that performs on-the-fly asynchronous decoding, effectively implementing parallel generation and speeding up inference. We present an instruction-finetuning dataset with Pasta-Lang-annotated responses for teaching LLMs to annotate semantic independence with Pasta-Lang as well as the methodology for creating the dataset. Our evaluation shows using the interpreter with a Pasta-Lang-equipped model achieves significant speedup while maintaining the same generation quality.

Paperid:3022 poster

Authors:Beom Jin Kang, NamJoon Kim, Hyun Kim

Title: LRA-QViT: Integrating Low-Rank Approximation and Quantization for Robust and Efficient Vision Transformers

Abstract:
Recently, transformer-based models have demonstrated state-of-the-art performance across various computer vision tasks, including image classification, detection, and segmentation. However, their substantial parameter count poses significant challenges for deployment in resource-constrained environments such as edge or mobile devices. Low-rank approximation (LRA) has emerged as a promising model compression technique, effectively reducing the number of parameters in transformer models by decomposing high-dimensional weight matrices into low-rank representations. Nevertheless, matrix decomposition inherently introduces information loss, often leading to a decline in model accuracy. Furthermore, existing studies on LRA largely overlook the quantization process, which is a critical step in deploying practical vision transformer (ViT) models. To address these challenges, we propose a robust LRA framework that preserves weight information after matrix decomposition and incorporates quantization tailored to LRA characteristics. First, we introduce a reparameterizable branch-based low-rank approximation (RB-LRA) method coupled with weight reconstruction to minimize information loss during matrix decomposition. Subsequently, we enhance model accuracy by integrating RB-LRA with knowledge distillation techniques. Lastly, we present an LRA-aware quantization method designed to mitigate the large outliers generated by LRA, thereby improving the robustness of the quantized model. To validate the effectiveness of our approach, we conducted extensive experiments on the ImageNet dataset using various ViT-based models. Notably, the Swin-B model with RB-LRA achieved a 31.8% reduction in parameters and a 30.4% reduction in GFLOPs, with only a 0.03% drop in accuracy. Furthermore, incorporating the proposed LRA-aware quantization method reduced accuracy loss by an additional 0.83% compared to naive quantization.

Paperid:3023 poster

Authors:Elias Chaibub Neto

Title: TabSDS: a Lightweight, Fully Non-Parametric, and Model Free Approach for Generating Synthetic Tabular Data

Abstract:
The development of deep generative models for tabular data is currently a very active research area in machine learning. These models, however, tend to be computationally heavy and require careful tuning of multiple model parameters. In this paper, we propose TabSDS - a lightweight, non-parametric, and model free alternative to tabular deep generative models which leverages rank and data shuffling transformations for generating synthetic data which closely approximates the joint probability distribution of the real data. We evaluate TabSDS against multiple baselines implemented in the Synthcity Python library across several datasets. TabSDS showed very competitive performance against all baselines (including TabDDPM - a strong baseline model for tabular data generation). Importantly, the execution time of TabSDS is orders of magnitude faster than the deep generative baselines, and also considerably faster than other computationally efficient baselines such as adversarial random forests.

Paperid:3024 poster

Authors:Yabo Liu, Waikeung Wong, Chengliang Liu, Xiaoling Luo, Yong Xu, Jinghua Wang

Title: Mutual Learning for SAM Adaptation: A Dual Collaborative Network Framework for Source-Free Domain Transfer

Abstract:
Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities across various visual tasks. However, its performance degrades significantly when deployed in new target domains with substantial distribution shifts. While existing self-training methods based on fixed teacher-student architectures have shown improvements, they struggle to ensure that the teacher network consistently outperforms the student under severe domain shifts. To address this limitation, we propose a novel Collaborative Mutual Learning Framework for source-free SAM adaptation, leveraging dual-networks in a dynamic and cooperative manner. Unlike fixed teacher-student paradigms, our method dynamically assigns the teacher and student roles by evaluating the reliability of each collaborative network in each training iteration. Our framework incorporates a dynamic mutual learning mechanism with three key components: a direct alignment loss for knowledge transfer, a reverse distillation loss to encourage diversity, and a triplet relationship loss to refine feature representations. These components enhance the adaptation capabilities of the collaborative networks, enabling them to generalize effectively to target domains while preserving their pre-trained knowledge. Extensive experiments on diverse target domains demonstrate that our proposed framework achieves state-of-the-art adaptation performance.

Paperid:3025 poster

Authors:Shizhan Gong, Yankai Jiang, Qi Dou, Farzan Farnia

Title: Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Abstract:
Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance. The code and models are available athttps://github.com/peterant330/KUEA.

Paperid:3026 poster

Authors:Kai Liu, Kaicheng Yang, Zheng Chen, Zhiteng Li, Yong Guo, Wenbo Li, Linghe Kong, Yulun Zhang

Title: BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution

Abstract:
While super-resolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrix branch (LRM). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. Moreover, we achieve excellent compression and acceleration. BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart. Our code and model are available athttps://github.com/Kai-Liu001/BiMaCoSR

Paperid:3027 poster

Authors:Zheng Fang, Lichuan Xiang, Xu Cai, Kaicheng Zhou, Hongkai Wen

Title: FlexControl: Computation-Aware Conditional Control with Differentiable Router for Text-to-Image Generation

Abstract:
Spatial conditioning control offers a powerful way to guide diffusion‐based generative models. Yet, most implementations (e.g., ControlNet) rely on ad-hoc heuristics to choose which network blocks to control �?an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that equips all diffusion blocks with control signals during training and employs a trainable gating mechanism to dynamically select which control signal to activate at each denoising step. By introducing a computation-aware loss, we can encourage the control signal to activate only when it benefits the generation quality. By eliminating manual control unit selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet and DiT architectures on different control methods, we show that our method can upgrade existing controllable generative models in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks to control. These results underscore the potential of a flexible, data‐driven approach for controlled diffusion and open new avenues for efficient generative model design. The code will soon be available athttps://github.com/Daryu-Fan/FlexControl.

Paperid:3028 poster

Authors:AJAY KUMAR JAISWAL, Yifan Wang, Lu Yin, Shiwei Liu, Runjin Chen, Jiawei Zhao, Ananth Grama, Yuandong Tian, Zhangyang Wang

Title: From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories, and Applications

Abstract:
Large Language Models (LLMs) matrices can often be expressed in low-rank format with potential to relax memory and compute resource requirements. Unlike previous works which pivot around developing novel matrix decomposition algorithms, in this work we focus to study the emerging non-uniform low-rank properties across weight matrices in LLMs through the lens of stabilizing gradient subspace. \textit{Firstly,} we provide a theoretical framework to understand the stabilization of gradient subspaces through Hessian analysis. \textit{Secondly,} we empirically establish a consequential relationship between the gradient dynamics and low-rank expressiveness of weight matrices. Our findings reveal that different LLM components exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present \textit{Weight Low-Rank Projection} \textbf{(WeLore)} that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient dynamics perspective illustrate that \textit{LRCs tend to have better finetuning capabilities} and their standalone finetuning can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. All codes and checkpoints will be released.

Paperid:3029 poster

Authors:Zixing Song, Ziqiao Meng, José Miguel Hernández-Lobato

Title: Domain-Adapted Diffusion Model for PROTAC Linker Design Through the Lens of Density Ratio in Chemical Space

Abstract:
Proteolysis-targeting chimeras (PROTACs) are a groundbreaking technology for targeted protein degradation, but designing effective linkers that connect two molecular fragments to form a drug-candidate PROTAC molecule remains a key challenge. While diffusion models show promise in molecular generation, current diffusion models for PROTAC linker design are typically trained on small molecule datasets, introducing distribution mismatches in the chemical space between small molecules and target PROTACs. Direct fine-tuning on limited PROTAC datasets often results in overfitting and poor generalization. In this work, we propose DAD-PROTAC, a domain-adapted diffusion model for PROTAC linker design, which addresses this distribution mismatch in chemical space through density ratio estimation to bridge the gap between small-molecule and PROTAC domains. By decomposing the target score estimator into a pre-trained score function and a lightweight score correction term, DAD-PROTAC achieves efficient fine-tuning without full retraining. Experimental results demonstrate its superior ability to generate high-quality PROTAC linkers.

Paperid:3030 poster

Authors:Zixiang Chen, Greg Yang, Qingyue Zhao, Quanquan Gu

Title: Global Convergence and Rich Feature Learning inL-Layer Infinite-Width Neural Networks underμParametrization

Abstract:
Despite deep neural networks' powerful representation learning capabilities, theoretical understanding of how networks can simultaneously achieve meaningful feature learning and global convergence remains elusive. Existing approaches like the neural tangent kernel (NTK) are limited because features stay close to their initialization in this parametrization, leaving open questions about feature properties during substantial evolution. In this paper, we investigate the training dynamics of infinitely wide, $L$-layer neural networks using the tensor program (TP) framework. Specifically, we show that, when trained with stochastic gradient descent (SGD) under the Maximal Update parametrization ($\mu$P) and mild conditions on the activation function, SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum. Our analysis leverages both the interactions among features across layers and the properties of Gaussian random variables, providing new insights into deep representation learning. We further validate our theoretical findings through experiments on real-world datasets.

Paperid:3031 poster

Authors:Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann

Title: REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Abstract:
To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.

Paperid:3032 poster

Authors:Cheryl Li, Tianyuan Xu, Steven Y. Guo

Title: Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment

Abstract:
Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) by generating natural language (NL) rationales that lead to the final answer. However, it struggles with numerical computation, which has somehow led to the development of program-aided techniques. Despite their potential, a persistent challenge remains: inconsistencies between LLM-reported reasoning steps and the logic in generated programs, which we term ``reasoning hallucinations." This stems from the inherent ambiguities of NL and the statistical nature of LLMs, which often lack rigorous logical coherence. To address this challenge, we propose a novel test-time scaling framework, Reasoning-as-Logic-Units (RaLU), which constructs a more reliable reasoning path by aligning logical units between the generated program and their corresponding NL descriptions. By decomposing the initially generated program into discrete units using static analysis, RaLU engages in an iterative dialogue with the LLM to judge, refine, and explain each unit. A rewind-and-correct mechanism ensures alignment between code statements and task requirements in each unit, ultimately forming a cohesive reasoning path under the program's logic, from which the model reaches a final solution. Our experiments demonstrate that RaLU significantly outperforms existing baselines in mathematical reasoning (GSM8K, MATH) and algorithmic reasoning (HumanEval+, MBPP+), underscoring its potential to advance LLM reasoning and programming by offering enhanced accuracy and interpretability.

Paperid:3033 poster

Authors:Max Milkert, David Hyde, Forrest John Laine

Title: Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Abstract:
In a neural network with ReLU activations, the number of piecewise linear regions in the output can grow exponentially with depth. However, this is highly unlikely to happen when the initial parameters are sampled randomly, which therefore often leads to the use of networks that are unnecessarily large. To address this problem, we introduce a novel parameterization of the network that restricts its weights so that a depth $d$ network produces exactly $2^d$ linear regions at initialization and maintains those regions throughout training under the parameterization. This approach allows us to learn approximations of convex, one-dimensional functions that are several orders of magnitude more accurate than their randomly initialized counterparts. We further demonstrate a preliminary extension of our construction to multidimensional and non-convex functions, allowing the technique to replace traditional dense layers in various architectures.

Paperid:3034 poster

Authors:Shuanghao Bai, Wanqi Zhou, Pengxiang Ding, Wei Zhao, Donglin Wang, Badong Chen

Title: Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation

Abstract:
Behavior Cloning (BC) is a widely adopted visual imitation learning method in robot manipulation. Current BC approaches often enhance generalization by leveraging large datasets and incorporating additional visual and textual modalities to capture more diverse information. However, these methods overlook whether the learned representations contain redundant information and lack a solid theoretical foundation to guide the learning process. To address these limitations, we adopt an information-theoretic perspective and introduce mutual information to quantify and mitigate redundancy in latent representations. Building on this, we incorporate the Information Bottleneck (IB) principle into BC, which extends the idea of reducing redundancy by providing a structured framework for compressing irrelevant information while preserving task-relevant features. This work presents the first comprehensive study on redundancy in latent representations across various methods, backbones, and experimental settings, while extending the generalizability of the IB to BC. Extensive experiments and analyses on the CortexBench and LIBERO benchmarks show consistent performance improvements with IB across various settings, underscoring the importance of reducing input data redundancy and highlighting its practical value for real-world applications.

Paperid:3035 poster

Authors:Dongchi Huang, Jiaqi WANG, Yang Li, Chunhe Xia, Tianle Zhang, Kaige Zhang

Title: PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning

Abstract:
Partial observability presents a significant challenge for safe reinforcement learning, as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer, a model-based safe reinforcement learning approach that leverages privileged information to enhance the agent's safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results demonstrate that our approach significantly outperforms existing methods in terms of safety and task-centric performance. Meanwhile, compared to alternative privileged model-based reinforcement learning methods, our approach exhibits superior performance and ease of training.

Paperid:3036 poster

Authors:Xiaowen Ma, Zhen-Liang Ni, Shuai Xiao, Xinghao Chen

Title: TimePro: Efficient Multivariate Long-term Time Series Forecasting with Variable- and Time-Aware Hyper-state

Abstract:
In long-term time series forecasting, different variables often influence the target variable over distinct time intervals, a challenge known as the multi-delay issue. Traditional models typically process all variables or time points uniformly, which limits their ability to capture complex variable relationships and obtain non-trivial time representations. To address this issue, we propose TimePro, an innovative Mamba-based model that constructs variate- and time-aware hyper-states. Unlike conventional approaches that merely transfer plain states across variable or time dimensions, TimePro preserves the fine-grained temporal features of each variate token and adaptively selects the focused time points to tune the plain state. The reconstructed hyper-state can perceive both variable relationships and salient temporal information, which helps the model make accurate forecasting. In experiments, TimePro performs competitively on eight real-world long-term forecasting benchmarks with satisfactory linear complexity. Code is available athttps://github.com/xwmaxwma/TimePro.

Paperid:3037 poster

Authors:David Keetae Park, Xihaier Luo, Guang Zhao, Seungjun Lee, Miruna Oprescu, Shinjae Yoo

Title: SCENT: Robust Spatiotemporal Learning for Continuous Scientific Data via Scalable Conditioned Neural Fields

Abstract:
Spatiotemporal learning is challenging due to the intricate interplay between spatial and temporal dependencies, the high dimensionality of the data, and scalability constraints. These challenges are further amplified in scientific domains, where data is often irregularly distributed (e.g., missing values from sensor failures) and high-volume (e.g., high-fidelity simulations), posing additional computational and modeling difficulties. In this paper, we present SCENT, a novel framework for scalable and continuity-informed spatiotemporal representation learning. SCENT unifies interpolation, reconstruction, and forecasting within a single architecture. Built on a transformer-based encoder-processor-decoder backbone, SCENT introduces learnable queries to enhance generalization and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. To ensure scalability in both data size and model complexity, we incorporate a sparse attention mechanism, enabling flexible output representations and efficient evaluation at arbitrary resolutions. We validate SCENT through extensive simulations and real-world experiments, demonstrating state-of-the-art performance across multiple challenging tasks while achieving superior scalability.

Paperid:3038 poster

Authors:Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, Thomas L. Griffiths

Title: Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

Abstract:
Chain-of-thought (CoT) prompting has become a widely used strategy for improving large language and multimodal model performance. However, it is still an open question under which settings CoT systematically reduces performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, focusing on six representative tasks from the psychological literature where deliberation hurts performance in humans. In three of these tasks, state-of-the-art models exhibit significant performance drop-offs with CoT (up to 36.3% absolute accuracy for OpenAI o1-preview compared to GPT-4o), while in others, CoT effects are mixed, with positive, neutral, and negative changes. While models and humans do not exhibit perfectly parallel cognitive processes, considering cases where thinking has negative consequences for humans helps identify settings where it negatively impacts models. By connecting the literature on human verbal thinking and deliberation with evaluations of CoT, we offer a perspective for understanding the impact of inference-time reasoning.

Paperid:3039 poster

Authors:Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory W. Wornell, Subhro Das, David Daniel Cox, Chuang Gan

Title: Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem:Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM?This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e.,an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models are fully open-sourced.

Paperid:3040 poster

Authors:Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

Title: A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Abstract:
This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and a significant dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), using an Actor-Critic approach. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

Paperid:3041 poster

Authors:Yu Yuan, Shizhao Sun, Qi Liu, Jiang Bian

Title: CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing

Abstract:
Computer Aided Design (CAD) is indispensable across various industries. \emph{Text-based CAD editing}, which automates the modification of CAD models based on textual instructions, holds great potential but remains underexplored. Existing methods primarily focus on design variation generation or text-based CAD generation, either lacking support for text-based control or neglecting existing CAD models as constraints. We introduce \emph{CAD-Editor}, the first framework for text-based CAD editing. To address the challenge of demanding triplet data with accurate correspondence for training, we propose an automated data synthesis pipeline. This pipeline utilizes design variation models to generate pairs of original and edited CAD models and employs Large Vision-Language Models (LVLMs) to summarize their differences into editing instructions. To tackle the composite nature of text-based CAD editing, we propose a locate-then-infill framework that decomposes the task into two focused sub-tasks: locating regions requiring modification and infilling these regions with appropriate edits. Large Language Models (LLMs) serve as the backbone for both sub-tasks, leveraging their capabilities in natural language understanding and CAD knowledge. Experiments show that CAD-Editor achieves superior performance both quantitatively and qualitatively.

Paperid:3042 poster

Authors:Chenyin Gao, Shu Yang, Mingyang Shan, Wenyu Ye, Ilya Lipkovich, Douglas Faries

Title: Doubly Protected Estimation for Survival Outcomes Utilizing External Controls for Randomized Clinical Trials

Abstract:
Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or real-world data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data heterogeneity. We propose a doubly protected estimator for the treatment-specific restricted mean survival time difference that is more efficient than trial-only estimators and mitigates biases from external data. Our method adjusts for covariate shifts via doubly robust estimation and addresses outcome drift using the DR-Learner for selective borrowing. The approach can incorporate machine learning to approximate survival curves and detect outcome drifts without strict parametric assumptions, borrowing only comparable external controls. Extensive simulation studies and a real-data application evaluating the efficacy of Galcanezumab in mitigating migraine headaches have been conducted to illustrate the effectiveness of our proposed framework.

Paperid:3043 poster

Authors:Sepehr Elahi, Paula Mürmann, Patrick Thiran

Title: Learn to Vaccinate: Combining Structure Learning and Effective Vaccination for Epidemic and Outbreak Control

Abstract:
The Susceptible-Infected-Susceptible (SIS) model is a widely used model for the spread of information and infectious diseases, particularly non-immunizing ones, on a graph. Given a highly contagious disease, a natural question is how to best vaccinate individuals to minimize the disease's extinction time. While previous works showed that the problem of optimal vaccination is closely linked to the NP-hardSpectral Radius Minimization(SRM) problem, they assumed that the graph is known, which is often not the case in practice. In this work, we consider the problem of minimizing the extinction time of an outbreak modeled by an SIS model where the graph on which the disease spreads is unknown and only the infection states of the vertices are observed. To this end, we split the problem into two: learning the graph and determining effective vaccination strategies. We propose a novel inclusion-exclusion-based learning algorithm and, unlike previous approaches, establish its sample complexity for graph recovery. We then detail an optimal algorithm for the SRM problem and prove that its running time is polynomial in the number of vertices for graphs with bounded treewidth. This is complemented by an efficient and effective polynomial-time greedy heuristic for any graph. Finally, we present experiments on synthetic and real-world data that numerically validate our learning and vaccination algorithms.

Paperid:3044 poster

Authors:Hongtu Zhou, Ruiling Yang, Yakun Zhu, Haoqi Zhao, Hai Zhang, Di Zhang, Junqiao Zhao, Chen Ye, Changjun Jiang

Title: CERTAIN: Context Uncertainty-aware One-Shot Adaptation for Context-based Offline Meta Reinforcement Learning

Abstract:
Existing context-based offline meta-reinforcement learning (COMRL) methods primarily focus on task representation learning and given-context adaptation performance. They often assume that the adaptation context is collected using task-specific behavior policies or through multiple rounds of collection. However, in real applications, the context should be collected by a policy in a one-shot manner to ensure efficiency and safety. We find that intrinsic context ambiguity across multiple tasks and out-of-distribution (OOD) issues due to distribution shift significantly affect the performance of one-shot adaptation, which has been largely overlooked in most COMRL research. To address this problem, we propose using heteroscedastic uncertainty in representation learning to identify ambiguous and OOD contexts, and train an uncertainty-aware context collecting policy for effective one-shot online adaptation. The proposed method can be integrated into various COMRL frameworks, including classifier-based, reconstrution-based and contrastive learning-based approaches. Empirical evaluations on benchmark tasks show that our method can improve one-shot adaptation performance by up to 36% and zero-shot adaptation performance by up to 34% compared to existing baseline COMRL methods.

Paperid:3045 poster

Authors:Chenyang Zhang, Xiaoyu Zhang, Jian Lou, Kai Wu, Zilong Wang, Xiaofeng Chen

Title: PoisonedEye: Knowledge Poisoning Attack on Retrieval-Augmented Generation based Large Vision-Language Models

Abstract:
Vision-Language Retrieval-Augmented Generation (VLRAG) systems have been widely applied to Large Vision-Language Models (LVLMs) to enhance their generation ability. However, the reliance on external multimodal knowledge databases renders VLRAG systems vulnerable to malicious poisoning attacks. In this paper, we introduce PoisonedEye, the first knowledge poisoning attack designed for VLRAG systems. Our attack successfully manipulates the response of the VLRAG system for the target query by injecting only one poison sample into the knowledge database. To construct the poison sample, we follow two key properties for the retrieval and generation process, and identify the solution by satisfying these properties. Besides, we also introduce a class query targeted poisoning attack, a more generalized strategy that extends the poisoning effect to an entire class of target queries. Extensive experiments on multiple query datasets, retrievers, and LVLMs demonstrate that our attack is highly effective in compromising VLRAG systems.

Paperid:3046 poster

Authors:Woohyeon Park, Woojin Kim, Jaeik Kim, Jaeyoung Do

Title: SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Abstract:
Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Paperid:3047 poster

Authors:Yifan Wang, Hourun Li, Ling Yue, Zhiping Xiao, Jia Yang, Changling Zhou, Wei Ju, Ming Zhang, Xiao Luo

Title: DANCE: Dual Unbiased Expansion with Group-acquired Alignment for Out-of-distribution Graph Fairness Learning

Abstract:
Graph neural networks (GNNs) have shown strong performance in graph fairness learning, which aims to ensure that predictions are unbiased with respect to sensitive attributes. However, existing approaches usually assume that training and test data share the same distribution, which rarely holds in the real world. To tackle this challenge, we propose a novel approach named Dual Unbiased Expansion with Group-acquired Alignment (DANCE) for graph fairness learning under distribution shifts. The core idea of our DANCE is to synthesize challenging yet unbiased virtual graph data in both graph and hidden spaces, simulating distribution shifts from a data-centric view. Specifically, we introduce the unbiased Mixup in the hidden space, prioritizing minor groups to address the potential imbalance of sensitive attributes. Simultaneously, we conduct fairness-aware adversarial learning in the graph space to focus on challenging samples and improve model robustness. To further bridge the domain gap, we propose a group-acquired alignment objective that prioritizes negative pair groups with identical sensitive labels. Additionally, a representation disentanglement objective is adopted to decorrelate sensitive attributes and target representations for enhanced fairness. Extensive experiments demonstrate the superior effectiveness of the proposed DANCE.

Paperid:3048 poster

Authors:Hongling Zheng, Li Shen, Yong Luo, Deheng Ye, Bo Du, Jialie Shen, Dacheng Tao

Title: Decision Mixer: Integrating Long-term and Local Dependencies via Dynamic Token Selection for Decision-Making

Abstract:
The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer's powerful distribution modeling capabilities, has demonstrated considerable promise in offline Reinforcement Learning (RL) tasks. Depending on the task's nature, it is crucial to carefully balance the interplay between inherent local features and long-term dependencies in Markov decision trajectories to mitigate potential performance degradation and unnecessary computational overhead. In this paper, we propose Decision Mixer (DM), which addresses the conflict between features of different scales in the modeling process from the perspective of dynamic integration. Drawing inspiration from conditional computation, we design a plug-and-play dynamic token selection mechanism to ensure the model can effectively allocate attention to different features based on task characteristics. Additionally, we employ an auxiliary predictor to alleviate the short-sightedness issue in the autoregressive sampling process. DM achieves state-of-the-art performance on various standard RL benchmarks while requiring significantly fewer computational resources, offering a viable solution for building efficient and scalable RL foundation models. Code is available at here.

Paperid:3049 poster

Authors:Benjamin Leblanc, Mathieu Bazinet, Nathaniel D'Amours, Alexandre Drouin, Pascal Germain

Title: Generalization Bounds via Meta-Learned Model Representations: PAC-Bayes and Sample Compression Hypernetworks

Abstract:
Both PAC-Bayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bounds for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.

Paperid:3050 poster

Authors:Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, Aditya Akella

Title: On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention for Long-Context LLM Serving

Abstract:
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first proposedual-state linear attention(DSLA), a novel design that maintains two specialized hidden states—one for preserving historical context and one for tracking recency—thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an onlineadaptive distillationframework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serveuses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-Serveyields2.3×faster inference than Llama2-7B and3.0×faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA’s dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions.

Paperid:3051 poster

Authors:Shiwei Li, Xiandi Luo, Xing Tang, Haozhao Wang, Hao Chen, weihongluo, Yuhua Li, xiuqiang He, Ruixuan Li

Title: Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics

Abstract:
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method. In standard LoRA layers, one of the matrices,AorB, is initialized to zero, ensuring that fine-tuning starts from the pretrained model. However, there is no theoretical support for this practice. In this paper, we investigate the impact of non-zero initialization on LoRA's fine-tuning dynamics from an infinite-width perspective. Our analysis reveals that, compared to zero initialization, simultaneously initializingAandBto non-zero values improves LoRA's robustness to suboptimal learning rates, particularly smaller ones. Further analysis indicates that although the non-zero initialization ofABintroduces random noise into the pretrained weight, it generally does not affect fine-tuning performance. In other words, fine-tuning does not need to strictly start from the pretrained model. The validity of our findings is confirmed through extensive experiments across various models and datasets. The code is available athttps://github.com/Leopold1423/non_zero_lora-icml25.

Paperid:3052 poster

Authors:Letian Chen, Nina Marie Moorman, Matthew Craig Gombolay

Title: ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

Abstract:
Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.

Paperid:3053 poster

Authors:Davide Sartor, Alberto Sinigaglia, Gian Antonio Susto

Title: Advancing Constrained Monotonic Neural Networks: Achieving Universal Approximation Beyond Bounded Activations

Abstract:
Imposing input-output constraints in multi-layer perceptrons (MLPs) plays a pivotal role in many real world applications. Monotonicity in particular is a common requirement in applications that need transparent and robust machine learning models. Conventional techniques for imposing monotonicity in MLPs by construction involve the use of non-negative weight constraints and bounded activation functions, which poses well known optimization challenges.In this work, we generalize previous theoretical results, showing that MLPs with non-negative weight constraint and activations that saturate on alternating sides are universal approximators for monotonic functions.Additionally, we show an equivalence between saturation side in the activations and sign of the weight constraint. This connection allows us to prove that MLPs with convex monotone activations and non-positive constrained weights also qualify as universal approximators, in contrast to their non-negative constrained counterparts.This results provide theoretical grounding to the empirical effectiveness observed in previous works, while leading to possible architectural simplification.Moreover, to further alleviate the optimization difficulties, we propose an alternative formulation that allows the network to adjust its activations according to the sign of the weights. This eliminates the requirement for weight reparameterization, easing initialization and improving training stability.Experimental evaluation reinforce the validity of the theoretical results, showing that our novel approach compares favorably to traditional monotonic architectures.

Paperid:3054 poster

Authors:Minsung Hwang, Jaejun Lee, Joyce Jiyoung Whang

Title: Stability and Generalization Capability of Subgraph Reasoning Models for Inductive Knowledge Graph Completion

Abstract:
Inductive knowledge graph completion aims to predict missing triplets in an incomplete knowledge graph that differs from the one observed during training. While subgraph reasoning models have demonstrated empirical success in this task, their theoretical properties, such as stability and generalization capability, remain unexplored. In this work, we present the first theoretical analysis of the relationship between the stability and the generalization capability for subgraph reasoning models. Specifically, we define stability as the degree of consistency in a subgraph reasoning model's outputs in response to differences in input subgraphs and introduce the Relational Tree Mover’s Distance as a metric to quantify the differences between the subgraphs. We then show that the generalization capability of subgraph reasoning models, defined as the discrepancy between the performance on training data and test data, is proportional to their stability. Furthermore, we empirically analyze the impact of stability on generalization capability using real-world datasets, validating our theoretical findings.

Paperid:3055 poster

Authors:Xiaoxuan Han, Songlin Yang, Wei Wang, Yang Li, Jing Dong

Title: Adaptive Median Smoothing: Adversarial Defense for Unlearned Text-to-Image Diffusion Models at Inference Time

Abstract:
Text-to-image (T2I) diffusion models have raised concerns about generating inappropriate content, such as "nudity". Despite efforts to erase undesirable concepts through unlearning techniques, these unlearned models remain vulnerable to adversarial inputs that can potentially regenerate such content. To safeguard unlearned models, we propose a novel inference-time defense strategy that mitigates the impact of adversarial inputs. Specifically, we first reformulate the challenge of ensuring robustness in unlearned diffusion models as a robust regression problem. Building upon the naive median smoothing for regression robustness, which employs isotropic Gaussian noise, we develop a generalized median smoothing framework that incorporates anisotropic noise. Based on this framework, we introduce a token-wiseAdaptive Median Smoothingmethod that dynamically adjusts noise intensity according to each token's relevance to target concepts. Furthermore, to improve inference efficiency, we explore implementations of this adaptive method at the text-encoding stage. Extensive experiments demonstrate that our approach enhances adversarial robustness while preserving model utility and inference efficiency, outperforming baseline defense techniques.

Paperid:3056 poster

Authors:Ekaterina Filimoshina, Dmitry Shirokov

Title: GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras

Abstract:
We propose, implement, and compare with competitors a new architecture of equivariant neural networks based on geometric (Clifford) algebras: Generalized Lipschitz Group Equivariant Neural Networks (GLGENN). These networks are equivariant to all pseudo-orthogonal transformations, including rotations and reflections, of a vector space with any non-degenerate or degenerate symmetric bilinear form. We propose a weight-sharing parametrization technique that takes into account the fundamental structures and operations of geometric algebras. Due to this technique, GLGENN architecture is parameter-light and has less tendency to overfitting than baseline equivariant models. GLGENN outperforms or matches competitors on several benchmarking equivariant tasks, including estimation of an equivariant function and a convex hull experiment, while using significantly fewer optimizable parameters.

Paperid:3057 poster

Authors:Xinyi Lu, Hao Zhang, Chenglin Li, Weijia Lu, ZHIFEI YANG, Wenrui Dai, xiaodong Zhang, Xiaofeng Ma, Can Zhang, Junni Zou, Hongkai Xiong

Title: FedSMU: Communication-Efficient and Generalization-Enhanced Federated Learning through Symbolic Model Updates

Abstract:
The significant communication overhead and client data heterogeneity have posed an important challenge to current federated learning (FL) paradigm. Existing compression-based and optimization-based FL algorithms typically focus on addressing either the model compression challenge or the data heterogeneity issue individually, rather than tackling both of them. In this paper, we observe that by symbolizing the client model updates to be uploaded (i.e., normalizing the magnitude for each model parameter at local clients), the model heterogeneity, essentially stemmed from data heterogeneity, can be mitigated, and thereby helping improve the overall generalization performance of the globally aggregated model at the server. Inspired with this observation, and further motivated by the success of Lion optimizer in achieving the optimal performance on most tasks in the centralized learning, we propose a new FL algorithm, called FedSMU, which simultaneously reduces the communication overhead and alleviates the data heterogeneity issue. Specifically, FedSMU splits the standard Lion optimizer into the local updates and global execution, where only the symbol of client model updates commutes between the client and server. We theoretically prove the convergence of FedSMU for the general non-convex settings. Through extensive experimental evaluations on several benchmark datasets, we demonstrate that our FedSMU algorithm not only reduces the communication overhead, but also achieves a better generalization performance than the other compression-based and optimization-based baselines.

Paperid:3058 poster

Authors:Yuchen Wang, Hongjue Zhao, Haohong Lin, Enze Xu, Lifang He, Huajie Shao

Title: A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex Environments

Abstract:
This work aims to address the problem of long-term dynamic forecasting in complex environments where data are noisy and irregularly sampled. While recent studies have introduced some methods to improve prediction performance, these approaches still face a significant challenge in handling long-term extrapolation tasks under such complex scenarios. To overcome this challenge, we propose Phy-SSM, a general-purpose framework that integrates partial physics knowledge into state space models (SSMs) for long-term dynamics forecasting in complex environments. Our motivation is that SSMs can effectively capture long-range dependencies in sequential data and model continuous dynamical systems, while the incorporation of physics knowledge improves generalization ability. The key challenge lies in how to seamlessly incorporate partially known physics into SSMs. To achieve this, we decompose partially known system dynamics into known and unknown state matrices, which are integrated into a Phy-SSM unit. To further enhance long-term prediction performance, we introduce a physics state regularization term to make the estimated latent states align with system dynamics. Besides, we theoretically analyze the uniqueness of the solutions for our method. Extensive experiments on three real-world applications, including vehicle motion prediction, drone state prediction, and COVID-19 epidemiology forecasting, demonstrate the superior performance of Phy-SSM over the baselines in both long-term interpolation and extrapolation tasks. The source code will be publicly available upon publication.

Paperid:3059 poster

Authors:Mohsen Dehghankar, Mahdi Erfanian, Abolfazl Asudeh

Title: An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Abstract:
Despite their tremendous success and versatility, Deep Neural Networks (DNNs) such as Large Language Models (LLMs) suffer from inference inefficiency and rely on advanced computational infrastructure. To address these challenges and make these models more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of DNNs with binary and ternary weight matrices. Particularly focusing on matrix multiplication as the bottleneck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a $n\times n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard vector-matrix multiplication. Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of our approach both with respect to time and memory, as we observed a reduction in the multiplication time up to 29x and memory usage up to 6x. When applied to LLMs, our experiments show up to a 5.24x speedup in the inference time.

Paperid:3060 poster

Authors:Yinhan He, Wendy Zheng, Yushun Dong, Yaochen Zhu, Chen Chen, Jundong Li

Title: Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models

Abstract:
Mechanistic interpretability (MI) research aims to understand large language models (LLMs) by identifying computational circuits, subgraphs of model components with associated functional interpretations, that explain specific behaviors. Current MI approaches focus on discovering task-specific circuits, which has two key limitations: (1) poor generalizability across different language tasks, and (2) high costs associated with requiring human or advanced LLM interpretation of each computational node. To address these challenges, we propose developing a ``modular circuit (MC) vocabulary'' consisting of task-agnostic functional units. Each unit consists of a small computational subgraph with its interpretation. This approach enables global interpretability by allowing different language tasks to share common MCs, while reducing costs by reusing established interpretations for new tasks. We establish five criteria for characterizing the MC vocabulary and present ModCirc, a novel global-level mechanistic interpretability framework for discovering MC vocabularies in LLMs. We demonstrate ModCirc's effectiveness by showing that it can identify modular circuits that perform well on various metrics.

Paperid:3061 poster

Authors:Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, Sijia Liu

Title: Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond

Abstract:
The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired data-model influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning'' the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal's impact in robustifying LLM unlearning. Codes are available athttps://github.com/OPTML-Group/Unlearn-Smooth.

Paperid:3062 poster

Authors:Jessica Dai, Paula Gradu, Inioluwa Deborah Raji, Benjamin Recht

Title: From Individual Experience to Collective Evidence: A Reporting-Based Framework for Identifying Systemic Harms

Abstract:
When an individual reports a negative interaction with some system, how can their personal experience be contextualized within broader patterns of system behavior? We study thereporting databaseproblem, where individual reports of adverse events arrive sequentially, and are aggregated over time. In this work, our goal is to identify whether there are subgroups—defined by any combination of relevant features—that are disproportionately likely to experience harmful interactions with the system. We formalize this problem as a sequential hypothesis test, and identify conditions on reporting behavior that are sufficient for making inferences about disparities in true rates of harm across subgroups. We show that algorithms for sequential hypothesis tests can be applied to this problem with a standard multiple testing correction. We then demonstrate our method on real-world datasets, including mortgage decisions and vaccine side effects; on each, our method (re-)identifies subgroups known to experience disproportionate harm using only a fraction of the data that was initially used to discover them.

Paperid:3063 poster

Authors:In Huh, changwook jeong, Muhammad Alam

Title: Context-Informed Neural ODEs Unexpectedly Identify Broken Symmetries: Insights from the Poincaré–Hopf Theorem

Abstract:
Out-Of-Domain (OOD) generalization is a significant challenge in learning dynamical systems, especially when they exhibit bifurcation, a sudden topological transition triggered by a model parameter crossing a critical threshold. A prevailing belief is that machine learning models, unless equipped with strong priors, struggle to generalize across bifurcations due to the abrupt changes in data characteristics. Contrary to this belief, we demonstrate that context-dependent Neural Ordinary Differential Equations (NODEs), trained solely on localized, pre-bifurcation, symmetric data and without physics-based priors, can still identify post-bifurcation, symmetry-breaking behaviors, even in a zero-shot manner. We interpret this capability to the model's implicit utilization of topological invariants, particularly the Poincaré index, and offer a formal explanation based on the Poincaré–Hopf theorem. We derive the conditions under which NODEs can recover—or erroneously hallucinate—broken symmetries without explicit training. Building on this insight, we showcase a topological regularizer inspired by the Poincaré–Hopf theorem and validate it empirically on phase transitions of systems described by the Landau–Khalatnikov equation.

Paperid:3064 poster

Authors:Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang

Title: Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data

Abstract:
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduceDiscriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts adiscriminative paradigmthat increases the probability of positive answers while suppressing potentially negative ones, aiming fordata predictioninstead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness, achieving performance better than SFT and comparable to if not better than SFT→PO. The code can be found athttps://github.com/Optimization-AI/DFT.

Paperid:3065 poster

Authors:Zhihe Yang, Yunjian Xu, Yang Zhang

Title: Q-Supervised Contrastive Representation: A State Decoupling Framework for Safe Offline Reinforcement Learning

Abstract:
Safe offline reinforcement learning (RL), which aims to learn the safety-guaranteed policy without risky online interaction with environments, has attracted growing recent attention for safety-critical scenarios. However, existing approaches encounter out-of-distribution problems during the testing phase, which can result in potentially unsafe outcomes. This issue arises due to the infinite possible combinations of reward-related and cost-related states. In this work, we propose State Decoupling with Q-supervised Contrastive representation (SDQC), a novel framework that decouples the global observations into reward- and cost-related representations for decision-making, thereby improving the generalization capability for unfamiliar global observations. Compared with the classical representation learning methods, which typically require model-based estimation (e.g., bisimulation), we theoretically prove that our Q-supervised method generates a coarser representation while preserving the optimal policy, resulting in improved generalization performance. Experiments on DSRL benchmark problems provide compelling evidence that SDQC surpasses other baseline algorithms, especially for its exceptional ability to achieve almost zero violations in more than half of the tasks, while the state-of-the-art algorithm can only achieve the same level of success in a quarter of the tasks. Further, we demonstrate that SDQC possesses superior generalization ability when confronted with unseen environments.

Paperid:3066 poster

Authors:Yuefan Cao, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Jiahao Zhang

Title: Dissecting Submission Limit in Desk-Rejections: A Mathematical Analysis of Fairness in AI Conference Policies

Abstract:
As AI research surges in both impact and volume, conferences have imposed submission limits to maintain paper quality and alleviate organizational pressure. In this work, we examine the fairness of desk-rejection systems under submission limits and reveal that existing practices can result in substantial inequities. Specifically, we formally define the paper submission limit problem and identify a critical dilemma: when the number of authors exceeds three, it becomes impossible to reject papers solely based on excessive submissions without negatively impacting innocent authors. Thus, this issue may unfairly affect early-career researchers, as their submissions may be penalized due to co-authors with significantly higher submission counts, while senior researchers with numerous papers face minimal consequences. To address this, we propose an optimization-based fairness-aware desk-rejection mechanism and formally define two fairness metrics: worst-case fairness and average fairness. We prove that optimizing worst-case fairness is NP-hard, whereas average fairness can be efficiently optimized via linear programming. Through case studies, we demonstrate that our proposed system ensures greater equity than existing methods, including those used in CVPR 2025, offering a more socially just approach to managing excessive submissions in AI conferences.

Paperid:3067 poster

Authors:Yu-Yang Qian, Yuan-Ze Xu, Zhen-Yu Zhang, Peng Zhao, Zhi-Hua Zhou

Title: TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree

Abstract:
Many real-world applications collect data in a streaming environment, where learning tasks are encountered sequentially. This necessitatescontinual learning(CL) to update models online, enabling adaptation to new tasks while preserving past knowledge to prevent catastrophic forgetting. Nowadays, with the flourish oflarge pre-trained models(LPMs),efficiencyhas become increasingly critical for CL, due to their substantial computational demands and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of Low-Rank Adapters), a novel approach that constructslayer-wiseadapters by leveraging hierarchical gradient similarity to enable efficient CL, particularly for LPMs. To reduce the computational burden of task similarity estimation, we employbandittechniques to develop an algorithm based on lower confidence bounds to efficiently explore the task structure. Furthermore, we use sparse gradient updates to facilitate parameter optimization, making the approach better suited for LPMs. Theoretical analysis is provided to justify the rationale behind our approach, and experiments on bothvision transformers(ViTs) andlarge language models(LLMs) demonstrate the effectiveness and efficiency of our approach across various domains, including vision and natural language processing tasks.

Paperid:3068 poster

Authors:Patrick Leask, Neel Nanda, Noura Al Moubayed

Title: Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Abstract:
Sparse Autoencoders (SAEs) are a popular method for decomposing Large Language Model (LLM) activations into interpretable latents, however they have a substantial training cost and SAEs learned on different models are not directly comparable. Motivated by relative representation similarity measures, we introduce Inference-Time Decomposition of Activation models (ITDAs). ITDAs are constructed by greedily sampling activations into a dictionary based on an error threshold on their matching pursuit reconstruction. ITDAs can be trained in 1% of the time of SAEs, allowing us to cheaply train them on Llama-3.1 70B and 405B. ITDA dictionaries also enable cross-model comparisons, and outperform existing methods like CKA, SVCCA, and a relative representation method on a benchmark of representation similarity. Code available athttps://github.com/pleask/itda.

Paperid:3069 poster

Authors:Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li

Title: How Contaminated Is Your Benchmark? Measuring Dataset Leakage in Large Language Models with Kernel Divergence

Abstract:
Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Measuring dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that evaluates dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings. Code is released inhttps://github.com/deeplearning-wisc/kernel-divergence-score.

Paperid:3070 poster

Authors:Yingzhen Yang

Title: A New Concentration Inequality for Sampling Without Replacement and Its Application for Transductive Learning

Abstract:
We introduce a new tool, Transductive Local Complexity (TLC), to analyze the generalization performance of transductive learning methods and motivate new transductive learning algorithms. Our work extends the idea of the popular Local Rademacher Complexity (LRC) to the transductive setting with considerable and novel changes compared to the analysis of typical LRC methods in the inductive setting. While LRC has been widely used as a powerful tool in the analysis of inductive models with sharp generalization bounds for classification and minimax rates for nonparametric regression, it remains an open problem whether a localized version of Rademacher complexity based tool can be designed and applied to transductive learning and gain sharp bound for transductive learning which is consistent with the inductive excess risk bound by (LRC). We give a confirmative answer to this open problem by TLC. Similar to the development of LRC, we build TLC by first establishing a novel and sharp concentration inequality for supremum of empirical processes for the gap between test and training loss in the setting of sampling uniformly without replacement. Then a peeling strategy and a new surrogate variance operator are used to derive the following excess risk bound in the transductive setting, which is consistent with that of the classical LRC based excess risk bound in the inductive setting. As an application of TLC, we use the new TLC tool to analyze the Transductive Kernel Learning (TKL) model, and derive sharper excess risk bound than that by the current state-of-the-art. As a result of independent interest, the concentration inequality for the test-train process is used to derive a sharp concentration inequality for the general supremum of empirical process involving random variables in the setting of sampling uniformly without replacement, with comparison to current concentration inequalities.

Paperid:3071 poster

Authors:Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda

Title: Deep Ridgelet Transform and Unified Universality Theorem for Deep and Shallow Joint-Group-Equivariant Machines

Abstract:
We present a constructive universal approximation theorem for learning machines equipped with joint-group-equivariant feature maps, called the joint-equivariant machines, based on the group representation theory. ``Constructive'' here indicates that the distribution of parameters is given in a closed-form expression known as the ridgelet transform. Joint-group-equivariance encompasses a broad class of feature maps that generalize classical group-equivariance. Particularly, fully-connected networks arenotgroup-equivariantbutare joint-group-equivariant. Our main theorem also unifies the universal approximation theorems for both shallow and deep networks. Until this study, the universality of deep networks has been shown in a different manner from the universality of shallow networks, but our results discuss them on common ground. Now we can understand the approximation schemes of various learning machines in a unified manner. As applications, we show the constructive universal approximation properties of four examples: depth-njoint-equivariant machine, depth-nfully-connected network, depth-ngroup-convolutional network, and a new depth-2network with quadratic forms whose universality has not been known.

Paperid:3072 poster

Authors:Yaoxuan Feng, Wenchao Chen, Yuxin Li, Bo Chen, Yubiao Wang, Zixuan Zhao, Hongwei Liu, Mingyuan Zhou

Title: OmiAD: One-Step Adaptive Masked Diffusion Model for Multi-class Anomaly Detection via Adversarial Distillation

Abstract:
Diffusion models have demonstrated outstanding performance in industrial anomaly detection. However, their iterative denoising nature results in slow inference speed, limiting their practicality for real-time industrial deployment. To address this challenge, we propose OmiAD, a one-step masked diffusion model for multi-class anomaly detection, derived from a well-designed multi-stepAdaptiveMaskedDiffusionModel (AMDM) and compressed usingAdversarialScoreDistillation (ASD). OmiAD first introduces AMDM, equipped with an adaptive masking strategy that dynamically adjusts masking patterns based on noise levels and encourages the model to reconstruct anomalies as normal counterparts by leveraging broader context, to reduce the pixel-level shortcut reliance. Then, ASD is developed to compress the multi-step diffusion process into a single-step generator by score distillation and incorporating a shared-weight discriminator effectively reusing parameters while significantly improving both inference efficiency and detection performance. The effectiveness of OmiAD is validated on four diverse datasets, achieving state-of-the-art performance across seven metrics while delivering a remarkable inference speedup.

Paperid:3073 poster

Authors:Enming Liang, Minghua Chen

Title: Efficient Bisection Projection to Ensure Neural-Network Solution Feasibility for Optimization over General Set

Abstract:
Neural networks (NNs) have emerged as promising tools for solving constrained optimization problems in real-time. However, ensuring constraint satisfaction for NN-generated solutions remains challenging due to prediction errors. Existing methods to ensure NN feasibility either suffer from high computational complexity or are limited to specific constraint types. We present Bisection Projection, an efficient approach to ensure NN solution feasibility for optimization over general compact sets with non-empty interiors. Our method comprises two key components: (i) a dedicated NN (called IPNN) that predicts interior points (IPs) with low eccentricity, which naturally accounts for approximation errors; (ii) a bisection algorithm that leverages these IPs to recover solution feasibility when initial NN solutions violate constraints. We establish theoretical guarantees by providing sufficient conditions for IPNN feasibility and proving bounded optimality loss of the bisection operation under IP predictions. Extensive evaluations on real-world non-convex problems demonstrate that Bisection Projection achieves superior feasibility and computational efficiency compared to existing methods, while maintaining comparable optimality gaps.

Paperid:3074 poster

Authors:ji deng, Zhao Li, Ji Zhang, Jun Gao

Title: EGPlace: An Efficient Macro Placement Method via Evolutionary Search with Greedy Repositioning Guided Mutation

Abstract:
Macro placement, which involves optimizing the positions of modules, is a critical phase in modern integrated circuit design and significantly influences chip performance. The growing complexity of integrated circuits demands increasingly sophisticated placement solutions. Existing approaches have evolved along two primary paths (e.g., constructive and adjustment methods), but they face significant practical limitations that affect real-world chip design. Recent hybrid frameworks such as WireMask-EA have attempted to combine these strategies, but significant technical barriers still remain, including the computational overhead from separated layout adjustment and reconstruction that often require complete layout rebuilding, the inefficient exploration of design spaces due to random mutation operations, and the computational complexity of mask-based construction methods that limit scalability. To overcome these limitations, we introduce EGPlace, a novel evolutionary optimization framework that combines guided mutation strategies with efficient layout reconstruction. EGPlace introduces two key innovations: a greedy repositioning-guided mutation operator that systematically identifies and optimizes critical layout regions, and an efficient mask computation algorithm that accelerates layout evaluation. Our extensive evaluation using ISPD2005 and Ariane RISC-V CPU benchmarks demonstrate that EGPlace reduces wirelength by \textbf{10.8%} and \textbf{9.3%} compared to WireMask-EA and the state-of-the-art reinforcement learning-based constructive method EfficientPlace, respectively, while achieving speedups of 7.8$\times$ and 2.8$\times$ over these methods.

Paperid:3075 poster

Authors:Paris Giampouras, HanQin Cai, Rene Vidal

Title: Guarantees of a Preconditioned Subgradient Algorithm for Overparameterized Asymmetric Low-rank Matrix Recovery

Abstract:
In this paper, we focus on a matrix factorization-based approach for robust recovery of low-rank asymmetric matrices from corrupted measurements. We propose an Overparameterized Preconditioned Subgradient Algorithm (OPSA) and provide, for the first time in the literature, linear convergence rates independent of the rank of the sought asymmetric matrix in the presence of gross corruptions. Our work goes beyond existing results in preconditioned-type approaches addressing their current limitation, i.e., the lack of convergence guarantees in the case of asymmetric matrices of unknown rank. By applying our approach to (robust) matrix sensing, we highlight its merits when the measurement operator satisfies a mixed-norm restricted isometry property. Lastly, we present extensive numerical experiments that validate our theoretical results and demonstrate the effectiveness of our approach for different levels of overparameterization and corruption from outliers.

Paperid:3076 poster

Authors:Yu Chen, Nathalia Céspedes, Payam Barnaghi

Title: A Closer Look at Transformers for Time Series Forecasting: Understanding Why They Work and Where They Struggle

Abstract:
Time-series forecasting is crucial across various domains, including finance, healthcare, and energy. Transformer models, originally developed for natural language processing, have demonstrated significant potential in addressing challenges associated with time-series data. These models utilize different tokenization strategies, point-wise, patch-wise, and variate-wise, to represent time-series data, each resulting in different scope of attention maps. Despite the emergence of sophisticated architectures, simpler transformers consistently outperform their more complex counterparts in widely used benchmarks. This study examines why point-wise transformers are generally less effective, why intra- and inter-variate attention mechanisms yield similar outcomes, and which architectural components drive the success of simpler models. By analyzing mutual information and evaluating models on synthetic datasets, we demonstrate that intra-variate dependencies are the primary contributors to prediction performance on benchmarks, while inter-variate dependencies have a minor impact. Additionally, techniques such as Z-score normalization and skip connections are also crucial. However, these results are largely influenced by the self-dependent and stationary nature of benchmark datasets. By validating our findings on real-world healthcare data, we provide insights for designing more effective transformers for practical applications.

Paperid:3077 poster

Authors:Xun Xian, Ganghua Wang, Xuan Bi, Rui Zhang, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding

Title: On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

Abstract:
Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs�?generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q&A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query’s embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q&A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.

Paperid:3078 poster

Authors:Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, Joseph E. Gonzalez

Title: The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models

Abstract:
Function calling, also called tool use, refers to an LLM's ability to invoke external functions, APIs, or user-defined tools in response to user queries—an essential capability for agentic LLM applications. Despite its prominence, there did not exist a standard benchmark to evaluate function calling abilities, due to two reasons �?the challenging nature of evaluating when a function call is valid, and the challenge of acquiring diverse, real-world functions. We present the Berkeley Function Calling Leaderboard (BFCL), a comprehensive benchmark designed to evaluate function calling capabilities in a wide range of real-world settings. The BFCL benchmark evaluates serial and parallel function calls, across various programming languages using a novel Abstract Syntax Tree (AST) evaluation method that can easily scale to thousands of functions. We construct the benchmark using a combination of expert curated, and user-contributed functions and associated prompts. Finally, BFCL benchmark evaluates the ability of models to abstain and reason in stateful multi-step agentic setting. Evaluating a wide range of models, we observe that while state-of-the-art LLMs excel at singleturn calls, memory, dynamic decision-making, and long-horizon reasoning remain open challenges. Since its preview, BFCL has become the defacto standard for evaluating function-calls, and can be accessed at gorilla.cs.berkeley.edu/leaderboard.html.

Paperid:3079 poster

Authors:Amirhesam Abedsoltan, Huaqing Zhang, Kaiyue Wen, Hongzhou Lin, Jingzhao Zhang, Mikhail Belkin

Title: Task Generalization with Autoregressive Compositional Structure: Can Learning fromDTasks Generalize toDTTasks?

Abstract:
Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of autoregressive compositional structure, where each task is a composition of T operations, and each operation is among a finite family of D subtasks. This yields a total class of size~D^T. We first show that generalization to all D^T tasks is theoretically achievable by training on only \tilde{O}(D) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via In-context Learning (ICL) and chain-of-thought (CoT) reasoning. We further demonstrate this exponential generalization in arithmetic and language translation, extending beyond parity functions.

Paperid:3080 poster

Authors:Yushi Huang, Zining Wang, Ruihao Gong, Jing Liu, Xinjie Zhang, Jinyang Guo, Xianglong Liu, Jun Zhang

Title: HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration

Abstract:
Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives-aligned predicted noise vs. high-quality images-between training and inference. These two discrepancies compromise both performance and efficiency. To this end, weharmonizetraining and inference with a novel learning-basedcachingframework dubbedHarmoniCa. It first incorporatesStep-Wise Denoising Training(SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, anImage Error Proxy-Guided Objective(IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256\times256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40\%$ latency reduction (i.e., $2.07\times$ theoretical speedup) and improved performance on PixArt-$\alpha$. Remarkably, ourimage-freeapproach reduces training time by $25\%$ compared with the previous method. Our code is available athttps://github.com/ModelTC/HarmoniCa.

Paperid:3081 poster

Authors:Siyuan Xu, Minghui Zhu

Title: Meta-Reinforcement Learning with Adaptation from Human Feedback via Preference-Order-Preserving Task Embedding

Abstract:
This paper studies meta-reinforcement learning with adaptation from human feedback. It aims to pre-train a meta-model that can achieve few-shot adaptation for new tasks from human preference queries without relying on reward signals. To solve the problem, we propose the frameworkadaptation via Preference-Order-preserving EMbedding(POEM). In the meta-training, the framework learns a task encoder, which maps tasks to a preference-order-preserving task embedding space, and a decoder, which maps the embeddings to the task-specific policies. In the adaptation from human feedback, the task encoder facilitates efficient task embedding inference for new tasks from the preference queries and then obtains the task-specific policy. We provide a theoretical guarantee for the convergence of the adaptation process to the task-specific optimal policy and experimentally demonstrate its state-of-the-art performance with substantial improvement over baseline methods.

Paperid:3082 poster

Authors:Shuang Zeng, Yunwen Lei

Title: Stability and Generalization Analysis of Decentralized SGD: Sharper Bounds Beyond Lipschitzness and Smoothness

Abstract:
Decentralized SGD (D-SGD) is a popular optimization method to train large-scale machine learning models. In this paper, we study the generalization behavior of D-SGD for both smooth and nonsmooth problems by leveraging the algorithm stability. For convex and smooth problems, we develop stability bounds involving the training errors to show the benefit of optimization in generalization. This improves the existing results by removing the Lipschitzness assumption and implying fast rates in a low-noise condition. We also develop the first optimal stability-based generalization bounds for D-SGD applied to nonsmooth problems. We further develop optimization error bounds which imply minimax optimal excess risk rates. Our novelty in the analysis consists of an error decomposition to use the co-coercivity of functions as well as the control of a neighboring-consensus error.

Paperid:3083 poster

Authors:Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, Ruoxi Jia

Title: Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

Abstract:
Safety alignment is crucial for Large Language Models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. To this end, we introduceACTOR(Activation-Based Training for Over-Refusal Reduction), a robust and compute- and-data efficient training framework that mini- mizes over-refusals by utilizing internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserving overall utility.

Paperid:3084 poster

Authors:Hongru Hu, Shuwen Zhang, Yongin Choi, Venkat S. Malladi, Gerald Quon

Title: sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models

Abstract:
Single-cell RNA sequencing (scRNA-seq) enables high-resolution exploration of cellular diversity and gene regulation, yet analyzing such data remains challenging due to technical and methodological limitations. Existing task-specific deep generative models like Variational Auto-Encoder (VAE) and its variants struggle to incorporate external biological knowledge, while transformer-based foundational large Language Models (LLMs or large LaMs) face limitations in computational cost and applicability to tabular gene expression data. Here, we introduce sciLaMA (single-cell interpretable Language Model Adapter), a novel representation learning framework that bridges these gaps by integrating static gene embeddings from multimodal LaMs with scRNA-seq tabular data through a paired-VAE architecture. Our approach generates context-aware representations for both cells and genes and outperforms state-of-the-art methods in key single-cell downstream tasks, including batch effect correction, cell clustering, and cell-state-specific gene marker and module identification, while maintaining computational efficiency. sciLaMA offers a computationally efficient, unified framework for comprehensive single-cell data analysis and biologically interpretable gene module discovery.

Paperid:3085 poster

Authors:Ruiquan Huang, Yingbin Liang, Jing Yang

Title: How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

Abstract:
Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as 'even pairs' and 'parity check', the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.

Paperid:3086 poster

Authors:Pranjal Aggarwal, Bryan Parno, Sean Welleck

Title: AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Abstract:
Automated code generation with large language models has gained significant traction, but there remains no guarantee of the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the generated code is correct. However, generating formally verified code with LLMs is hindered by the scarcity of training data and the complexity of formal proofs. To tackle this challenge, we introduce AlphaVerus, a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language and leveraging feedback from a verifier. AlphaVerus operates in three phases: exploration of candidate translations, Treefinement -- a novel tree search algorithm for program refinement using verifier feedback, and filtering misaligned specifications and programs to prevent reward hacking. Through this iterative process, AlphaVerus enables the LLaMA-3.1-70B model to generate verified code without human intervention or model finetuning. AlphaVerus shows an ability to generate formally verified solutions for HumanEval and MBPP, laying the groundwork for truly trustworthy code-generation agents.

Paperid:3087 poster

Authors:Tyler Kastner, Mark Rowland, Yunhao Tang, Murat A Erdogdu, Amir-massoud Farahmand

Title: Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Abstract:
We study the problem of distributional reinforcement learning using categorical parametrisations and a KL divergence loss. Previous work analyzing categorical distributional RL has done so using a Cramér distance-based loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, and compare to that of classical reinforcement learning. We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.

Paperid:3088 poster

Authors:Masatoshi Uehara, Xingyu Su, Yulai Zhao, Xiner Li, Aviv Regev, Shuiwang Ji, Sergey Levine, Tommaso Biancalani

Title: Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design

Abstract:
To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for reward-guided generation have been recently proposed due to their significance, current approaches predominantly focus on single-shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference-time reward optimization with diffusion models. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Finally, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and DNA design.

Paperid:3089 poster

Authors:Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xianpei Han, Debing Zhang, Le Sun

Title: The Devil Is in the Details: Tackling Unimodal Spurious Correlations for Generalizable Multimodal Reward Models

Abstract:
Multimodal Reward Models (MM-RMs) are crucial for aligning Large Language Models (LLMs) with human preferences, particularly as LLMs increasingly interact with multimodal data. However, we find that MM-RMs trained on existing datasets often struggle to generalize to out-of-distribution data due to their reliance on unimodal spurious correlations, primarily text-only shortcuts within the training distribution, which prevents them from leveraging true multimodal reward functions. To address this, we introduce a Shortcut-aware MM-RM learning algorithm that mitigates this issue by dynamically reweighting training samples, shifting the distribution toward better multimodal understanding, and reducing dependence on unimodal spurious correlations. Our experiments demonstrate significant improvements in generalization, downstream task performance, and scalability, establishing a more robust framework for multimodal reward modeling. Our source code is provided onhttps://github.com/alignrm/Generalizable-MM-RM.

Paperid:3090 poster

Authors:Chengting Yu, Xiaochen Zhao, Lei Liu, Shu Yang, Gaoang Wang, Erping Li, Aili Wang

Title: Efficient Logit-based Knowledge Distillation of Deep Spiking Neural Networks for Full-Range Timestep Deployment

Abstract:
Spiking Neural Networks (SNNs) are emerging as a brain-inspired alternative to traditional Artificial Neural Networks (ANNs), prized for their potential energy efficiency on neuromorphic hardware. Despite this, SNNs often suffer from accuracy degradation compared to ANNs and face deployment challenges due to fixed inference timesteps, which require retraining for adjustments, limiting operational flexibility. To address these issues, our work considers the spatio-temporal property inherent in SNNs, and proposes a novel distillation framework for deep SNNs that optimizes performance across full-range timesteps without specific retraining, enhancing both efficacy and deployment adaptability. We provide both theoretical analysis and empirical validations to illustrate that training guarantees the convergence of all implicit models across full-range timesteps. Experimental results on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate state-of-the-art performance among distillation-based SNNs training methods. Our code is available athttps://github.com/Intelli-Chip-Lab/snn_temporal_decoupling_distillation.

Paperid:3091 poster

Authors:Yuxuan Zhou, Xien Liu, Chenwei Yan, Chen Ning, Xiao Zhang, Boxun Li, Xiangling Fu, Shijin Wang, Guoping Hu, Yu Wang, Ji Wu

Title: Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving

Abstract:
Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks, but their capabilities across different cognitive levels remain underexplored. Inspired by Bloom's Taxonomy, we propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain in this study. The framework integrates existing medical datasets and introduces tasks targeting three cognitive levels: preliminary knowledge grasp, comprehensive knowledge application, and scenario-based problem solving. Using this framework, we systematically evaluate state-of-the-art general and medical LLMs from six prominent families: Llama, Qwen, Gemma, Phi, GPT, and DeepSeek. Our findings reveal a significant performance decline as cognitive complexity increases across evaluated models, with model size playing a more critical role in performance at higher cognitive levels. Our study highlights the need to enhance LLMs' medical capabilities at higher cognitive levels and provides insights for developing LLMs suited to real-world medical applications.

Paperid:3092 poster

Authors:Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, Chip Hong Chang

Title: Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection

Abstract:
Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misalignment of their knowledge and forensics patterns. To this end, we present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection. Our framework includes a Knowledge-guided Forgery Detector (KFD), a Forgery Prompt Learner (FPL), and a Large Language Model (LLM). The KFD is used to calculate correlations between image features and pristine/deepfake image description embeddings, enabling forgery classification and localization. The outputs of the KFD are subsequently processed by the Forgery Prompt Learner to construct fine-grained forgery prompt embeddings. These embeddings, along with visual and question prompt embeddings, are fed into the LLM to generate textual detection responses. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, DFDC, and DF40, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.

Paperid:3093 poster

Authors:Kyle Heuton, Frederick Muench, Shikhar Shrestha, Thomas J. Stopka, Michael C Hughes

Title: Decision-aware Training of Spatiotemporal Forecasting Models to Select a Top-K Subset of Sites for Intervention

Abstract:
Optimal allocation of scarce resources is a common problem for decision makers faced with choosing a limited number of locations for intervention. Spatiotemporal prediction models could make such decisions data-driven. A recent performance metric called fraction of best possible reach (BPR) measures the impact of using a model’s recommended size K subset of sites compared to the best possible top-K in hindsight. We tackle two open problems related to BPR. First, we explorehow to rankall sites numerically given a probabilistic model that predicts event counts jointly across sites. Ranking via the per-site mean is suboptimal for BPR. Instead, we offer a better ranking for BPR backed by decision theory. Second, we explorehow to traina probabilistic model's parameters to maximize BPR. Discrete selection of K sites implies all-zero parameter gradients which prevent standard gradient training. We overcome this barrier via advances in perturbed optimizers. We further suggest a training objective that combines likelihood with a BPR constraint to deliver high-quality top-K rankings as well as good forecasts for all sites. We demonstrate our approach on two where-to-intervene applications: mitigating opioid-related fatal overdoses for public health and monitoring endangered wildlife.

Paperid:3094 poster

Authors:Manwen Liao, Yan Zhu, Weitian Zhang, Yuxiang Yang

Title: Reinforced Learning Explicit Circuit Representations for Quantum State Characterization from Local Measurements

Abstract:
Characterizing quantum states is essential for advancing many quantum technologies. Recently, deep neural networks have been applied to learn quantum states by generating compressed implicit representations. Despite their success in predicting properties of the states, these representations remain a black box, lacking insights into strategies for experimental reconstruction. In this work, we aim to open this black box by developing explicit representations through generating surrogate state preparation circuits for property estimation. We design a reinforcement learning agent equipped with a Transformer-based architecture and a local fidelity reward function. Relying solely on measurement data from a few neighboring qubits, our agent accurately recovers properties of target states. We also theoretically analyze the global fidelity the agent can achieve when it learns a good local approximation. Extensive experiments demonstrate the effectiveness of our framework in learning various states of up to 100 qubits, including those generated by shallow Instantaneous Quantum Polynomial circuits, evolved by Ising Hamiltonians, and many-body ground states. Furthermore, the learned circuit representations can be applied to Hamiltonian learning as a downstream task utilizing a simple linear model.

Paperid:3095 poster

Authors:Baohong Li, Yingrong Wang, Anpeng Wu, Ming Ma, Ruoxuan Xiong, Kun Kuang

Title: Generalizing Causal Effects from Randomized Controlled Trials to Target Populations across Diverse Environments

Abstract:
Generalizing causal effects from Randomized Controlled Trials (RCTs) to target populations across diverse environments is of significant practical importance, as RCTs are often costly and logistically complex to conduct. A key challenge is environmental shift, defined as changes in the distribution and availability of covariates between source and target environments. A common approach addressing this challenge is to identify a separating set--covariates that govern both treatment effect heterogeneity and environmental differences--and combine RCT samples with target populations matched on this set. However, this approach assumes that the separating set is fully observed and shared across datasets, an assumption often violated in practice. We propose a novel Two-Stage Doubly Robust (2SDR) method that relaxes this assumption by allowing the separating set to be observed in only one of the two datasets. 2SDR leverages shadow variables to impute missing components of the separating set and generalize treatment effects across environments in a two-stage procedure. We show the identification of causal effects in target environments under 2SDR and demonstrate its effectiveness through extensive experiments on both synthetic and real-world datasets.

Paperid:3096 poster

Authors:Senyu Han, Hongchuan Zeng, Kai Yu, Lu Chen

Title: Heads up! Large Language Models Can Perform Tasks Without Your Instruction via Selective Attention Head Masking

Abstract:
Large language models (LLMs) consist of numerous Transformer modules, and while the models can perform various functions, it remains an open question of how these modules are combined to elicit distinct inherent functionalities. In this paper, we investigate the modules inside LLMs and demonstrate that, by simply masking or retaining specific attention heads during inference, LLMs can exhibit specific task functionalities without requiring explicit instructions or modifications to the model parameters. Experiments across various models and tasks reveal that LLMs inherently encode ``functional pathways'', the structured groups of interdependent attention heads that are crucial for executing specific tasks. These pathways not only govern the model's functional behaviors but also enhance parameter efficiency, as suppressing attention heads outside the pathway can improve task performance. The code is available in this repository:https://github.com/OpenDFM/HeadsUp.

Paperid:3097 poster

Authors:Zhuoying Li, Zhu Xu, Yuxin Peng, Yang Liu

Title: Balancing Preservation and Modification: A Region and Semantic Aware Metric for Instruction-Based Image Editing

Abstract:
Instruction-based image editing, which aims to modify the image faithfully towards instruction while preserving irrelevant content unchanged, has made advanced progresses. However, there still lacks a comprehensive metric for assessing the editing quality. Existing metrics either require high costs concerning human evaluation, which hinders large-scale evaluation, or adapt from other tasks and lose specified concerns, failing to comprehensively evaluate the modification of instruction and the preservation of irrelevant regions, resulting in biased evaluation. To tackle it, we introduce a new metric Balancing Preservation Modification (BPM), that tailored for instruction-based image editing by explicitly disentangling the image into editing-relevant and irrelevant regions for specific consideration. We first identify and locate editing-relevant regions, followed by a two-tier process to assess editing quality: Region-Aware Judge evaluates whether the position and size of the edited region align with instruction, and Semantic-Aware Judge further assesses the instruction content compliance within editing-relevant regions as well as content preservation within irrelevant regions, yielding comprehensive and interpretable quality assessment. Moreover, the editing-relevant region localization in BPM can be integrated into image editing approaches to improve the editing quality, manifesting its wild application. We verify the effectiveness of BPM metric on comprehensive instruction-editing data, and the re- sults show that we yield the highest alignment with human evaluation compared to existing metrics, indicating efficacy. The code is available athttps://joyli-x.github.io/BPM/.

Paperid:3098 poster

Authors:Geon-Woo Kim, Junbo Li, Shashidhar Gandham, Omar Baldonado, Adithya Gangidi, Pavan Balaji, Zhangyang Wang, Aditya Akella

Title: HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training

Abstract:
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5× faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1×. Crucially, HALoS preserves the model quality of fully synchronous SGD—matching or exceeding accuracy on standard language modeling and downstream benchmarks—while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.

Paperid:3099 poster

Authors:Hanru Bai, Weiyang Ding

Title: KoNODE: Koopman-Driven Neural Ordinary Differential Equations with Evolving Parameters for Time Series Analysis

Abstract:
Neural ordinary differential equations (NODEs) have demonstrated strong capabilities in modeling time series. However, existing NODE- based methods often focus solely on the surface-level dynamics derived from observed states, which limits their ability to capture more complex underlying behaviors. To overcome this challenge, we propose KoNODE, a Koopman-driven NODE framework that explicitly models the evolution of ODE parameters over time to encode deep-level information. KoNODE captures the essential yet simple intrinsic linear dynamics that govern the surface dynamics by employing Koopman operators. Our framework operates at three hierarchical levels: the observed state dynamics, the parameter dynamics, and the Koopman linear dynamics, representing the fundamental driving rules of the state dynamics. The proposed approach offers significant improvements in two critical time series tasks: long-term prediction (enabled by the simple linear dynamics) and generalization to new data (driven by the evolving ODE parameters). We validate KoNODE through experiments on synthetic data from complex dynamic systems and real-world datasets, demonstrating its effectiveness in practical scenarios.

Paperid:3100 poster

Authors:Jiecheng Lu, Shihao Yang

Title: Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Abstract:
Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention often outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

Paperid:3101 poster

Authors:Leo de Castro, Daniel Escudero, Adya Agrawal, Antigoni Polychroniadou, Manuela Veloso

Title: EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Abstract:
As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a third-party cloud. While this saves clients' computation, it risks leaking the clients' LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to benchmark an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.

Paperid:3102 poster

Authors:Wangzhi Zhan, Jianpeng Chen, Dongqi Fu, Dawei Zhou

Title: UniMate: A Unified Model for Mechanical Metamaterial Generation, Property Prediction, and Condition Confirmation

Abstract:
Metamaterials are artificial materials that are designed to meet unseen properties in nature, such as ultra-stiffness and negative materials indices. In mechanical metamaterial design, three key modalities are typically involved, i.e., 3D topology, density condition, and mechanical property. Real-world complex application scenarios place the demanding requirements on machine learning models to consider all three modalities together. However, a comprehensive literature review indicates that most existing works only consider two modalities, e.g., predicting mechanical properties given the 3D topology or generating 3D topology given the required properties. Therefore, there is still a significant gap for the state-of-the-art machine learning models capturing the whole. Hence, we propose a unified model named UniMate, which consists of a modality alignment module and a synergetic diffusion generation module. Experiments indicate that UniMate outperforms the other baseline models in topology generation task, property prediction task, and condition confirmation task by up to 80.2%, 5.1%, and 50.2%, respectively. We open-source our proposed UniMate model and corresponding results athttps://github.com/wzhan24/UniMate.

Paperid:3103 poster

Authors:Christopher Subich, Syed Zahid Husain, Leo Separovic, Jing Yang

Title: Fixing the Double Penalty in Data-Driven Weather Forecasting Through a Modified Spherical Harmonic Loss Function

Abstract:
Recent advancements in data-driven weather forecasting models have delivered deterministic models that outperform the leading operational forecast systems based on traditional, physics-based models. However, these data-driven models are typically trained with a mean squared error loss function, which causes smoothing of fine scales through a ``double penalty'' effect. We develop a simple, parameter-free modification to this loss function that avoids this problem by separating the loss attributable to decorrelation from the loss attributable to spectral amplitude errors. Fine-tuning the GraphCast model with this new loss function results in sharp deterministic weather forecasts, an increase of the model's effective resolution from 1,250km to 160km, improvements to ensemble spread, and improvements to predictions of tropical cyclone strength and surface wind extremes.

Paperid:3104 poster

Authors:Rohan Shenoy, Evan Austen Coleman, Hans Gaensbauer, Elsa Olivetti

Title: Counting atoms faster: policy-based nuclear magnetic resonance pulse sequencing for atomic abundance measurement

Abstract:
Quantifying the elemental composition of a material is a general scientific challenge with broad relevance to environmental sustainability. Existing techniques for the measurement of atomic abundances generally require laboratory conditions and expensive equipment. As a result, they cannot be deployedin situwithout significant capital investment, limiting their proliferation. Measurement techniques based on nuclear magnetic resonance (NMR) hold promise in this setting due to their applicability across the periodic table, their non-destructive manipulation of samples, and their amenability toin silicooptimization. In this work, we learn policies to modulate NMR pulses for rapid atomic abundance quantification. Our approach involves three inter-operating agents which (1) rapidly align nuclear spins for measurement, (2) quickly force relaxation to equilibrium, and (3) toggle control between agents (1) and (2) to minimize overall measurement time. To demonstrate this technique, we consider a specific use case of low-magnetic-field carbon-13 quantification for low-cost, portable analysis of foodstuffs and soils. We find significant performance improvements relative to traditional NMR pulse sequencing, and discuss limitations on the applicability of this approach.

Paperid:3105 poster

Authors:Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng

Title: LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs �?No Silver Bullet for LC or RAG Routing

Abstract:
As Large Language Model (LLM) context windows expand, the necessity of Retrieval-Augmented Generation (RAG) for integrating external knowledge is debated. Existing RAG vs. long-context (LC) LLM comparisons are often inconclusive due to benchmark limitations. We introduce LaRA, a novel benchmark with 2326 test cases across four QA tasks and three long context types, for rigorous evaluation. Our analysis of eleven LLMs reveals the optimal choice between RAG and LC depends on a complex interplay of model capabilities, context length, task type, and retrieval characteristics, offering actionable guidelines for practitioners. Our code and dataset is provided at:https://github.com/Alibaba-NLP/LaRA

Paperid:3106 poster

Authors:Jingyang Ke, Feiyang Wu, Jiyi Wang, Jeffrey Markowitz, Anqi Wu

Title: Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors

Abstract:
Traditional approaches to studying decision-making in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWitching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.

Paperid:3107 poster

Authors:Tung Minh Luu, Younghwan Lee, Donghoon Lee, Sunho Kim, Min Jun Kim, Chang D. Yoo

Title: Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models

Abstract:
Designing effective reward functions remains a fundamental challenge in reinforcement learning (RL), as it often requires extensive human effort and domain expertise. While RL from human feedback has been successful in aligning agents with human intent, acquiring high-quality feedback is costly and labor-intensive, limiting its scalability. Recent advancements in foundation models present a promising alternative--leveraging AI-generated feedback to reduce reliance on human supervision in reward learning. Building on this paradigm, we introduce ERL-VLM, an enhanced rating-based RL method that effectively learns reward functions from AI feedback. Unlike prior methods that rely on pairwise comparisons, ERL-VLM queries large vision-language models (VLMs) for absolute ratings of individual trajectories, enabling more expressive feedback and improved sample efficiency. Additionally, we propose key enhancements to rating-based RL, addressing instability issues caused by data imbalance and noisy labels. Through extensive experiments across both low-level and high-level control tasks, we demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods. Our results demonstrate the potential of AI feedback for scaling RL with minimal human intervention, paving the way for more autonomous and efficient reward learning.

Paperid:3108 poster

Authors:Alan Nawzad Amin, Andres Potapczynski, Andrew Gordon Wilson

Title: Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra

Abstract:
To understand how genetic variants in human genomes manifest in phenotypes - traits like height or diseases like asthma - geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Surprisingly, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.

Paperid:3109 poster

Authors:Mengting Ma, Yizhen Jiang, Mengjiao Zhao, Jiaxin Li, Wei Zhang

Title: HetSSNet: Spatial-Spectral Heterogeneous Graph Learning Network for Panchromatic and Multispectral Images Fusion

Abstract:
Remote sensing pansharpening aims to reconstruct spatial-spectral properties during the fusion of panchromatic (PAN) images and low- resolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HR- MS) images. In the mainstream modeling strategies, i.e., CNN and Transformer, the input images are treated as the equal-sized grid of pixels in the Euclidean space. They have limitations in facing remote sensing images with irregular ground objects. Graph is the more flexible structure, however, there are two major challenges when modeling spatial-spectral properties with graph: 1) constructing the customized graph structure for spatial-spectral relationship priors; 2) learning the unified spatial-spectral representation through the graph. To address these challenges, we propose the spatial-spectral heterogeneous graph learning network, named HetSSNet. Specifically, HetSSNet initially constructs the heterogeneous graph structure for pansharpening, which explicitly describes pansharpening-specific relationships. Subsequently, the basic relationship pattern generation module is designed to extract the multiple relationship patterns from the heterogeneous graph. Finally, relationship pattern aggregation module is exploited to collaboratively learn unified spatial-spectral representation across different relationships among nodes with adaptive importance learning from local and global perspectives. Extensive experiments demonstrate the significant superiority and generalization of HetSSNet.

Paperid:3110 poster

Authors:Zhangyi Hu, Jiemin Wu, Hua XU, Mingqian Liao, Ninghui Feng, Bo Gao, Songning Lai, Yutao Yue

Title: IMTS is Worth Time×Channel Patches: Visual Masked Autoencoders for Irregular Multivariate Time Series Prediction

Abstract:
Irregular Multivariate Time Series (IMTS) forecasting is challenging due to the unaligned nature of multi-channel signals and the prevalence of extensive missing data. Existing methods struggle to capture reliable temporal patterns from such data due to significant missing values. While pre-trained foundation models show potential for addressing these challenges, they are typically designed for Regularly Sampled Time Series (RTS). Motivated by the visual Mask AutoEncoder's (MAE) powerful capability for modeling sparse multi-channel information and its success in RTS forecasting, we proposeVIMTS, a framework adaptingVisual MAE forIMTSforecasting. To mitigate the effect of missing values, VIMTS first processes IMTS along the timeline into feature patches at equal intervals. These patches are then complemented using learned cross-channel dependencies. Then it leverages visual MAE's capability in handling sparse multichannel data for patch reconstruction, followed by a coarse-to-fine technique to generate precise predictions from focused contexts. In addition, we integrate self-supervised learning for improved IMTS modeling by adapting the visual MAE to IMTS data. Extensive experiments demonstrate VIMTS's superior performance and few-shot capability, advancing the application of visual foundation models in more general time series tasks. Our code is available athttps://github.com/WHU-HZY/VIMTS.

Paperid:3111 poster

Authors:Yan Zhong, Chenxi Yang, Suyuan Zhao, Tingting Jiang

Title: Semi-Supervised Blind Quality Assessment with Confidence-quantifiable Pseudo-label Learning for Authentic Images

Abstract:
This paper presents CPL-IQA, a novel semi-supervised blind image quality assessment (BIQA) framework for authentic distortion scenarios. To address the challenge of limited labeled data in IQA area, our approach leverages confidence-quantifiable pseudo-label learning to effectively utilize unlabeled authentically distorted images. The framework operates through a preprocessing stage and two training phases: first converting MOS labels to vector labels via entropy minimization, followed by an iterative process that alternates between model training and label optimization. The key innovations of CPL-IQA include a manifold assumption-based label optimization strategy and a confidence learning method for pseudo-labels, which enhance reliability and mitigate outlier effects. Experimental results demonstrate the framework's superior performance on real-world distorted image datasets, offering a more standardized semi-supervised learning paradigm without requiring additional supervision or network complexity.

Paperid:3112 poster

Authors:Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu, Wei Wei, Yu Cheng

Title: Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment

Abstract:
While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. {Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture.} {However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture.} To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE’s efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT’s state-of-the-art performance, closing the gap with Full FT. Our code is available at:https://github.com/Facico/GOAT-PEFT

Paperid:3113 poster

Authors:Shengju Yu, Yiu-ming Cheung, Siwei Wang, Xinwang Liu, En Zhu

Title: Bifurcate then Alienate: Incomplete Multi-view Clustering via Coupled Distribution Learning with Linear Overhead

Abstract:
Despite remarkable advances, existing incomplete multi-view clustering (IMC) methods typically leverage either perspective-shared or perspective-specific determinants to encode cluster representations. To address this limitation, we introduce a BACDL algorithm designed to explicitly capture both concurrently, thereby exploiting heterogeneous data more effectively. It chooses to bifurcate feature clusters and further alienate them to enlarge the discrimination. With distribution learning, it successfully couples view guidance into feature clusters to alleviate dimension inconsistency. Then, building on the principle that samples in one common cluster own similar marginal distribution and conditional distribution, it unifies the association between feature clusters and sample clusters to bridge all views. Thereafter, all incomplete sample clusters are reordered and mapped to a common one to formulate clustering embedding. Last, the overall linear overhead endows it with a resource-efficient characteristic.

Paperid:3114 poster

Authors:Dongmin Bang, Inyoung Sung, Yinhua Piao, Sangseon Lee, Sun Kim

Title: BounDr.E: Predicting Drug-likeness via Biomedical Knowledge Alignment and EM-like One-Class Boundary Optimization

Abstract:
The advent of generative AI now enables large-scale $\textit{de novo}$ design of molecules, but identifying viable drug candidates among them remains an open problem. Existing drug-likeness prediction methods often rely on ambiguous negative sets or purely structural features, limiting their ability to accurately classify drugs from non-drugs. In this work, we introduce BounDr.E}: a novel modeling of drug-likeness as a compact space surrounding approved drugs through a dynamic one-class boundary approach. Specifically, we enrich the chemical space through biomedical knowledge alignment, and then iteratively tighten the drug-like boundary by pushing non-drug-like compounds outside via an Expectation-Maximization (EM)-like process. Empirically, BounDr.E achieves 10% F1-score improvement over the previous state-of-the-art and demonstrates robust cross-dataset performance, including zero-shot toxic compound filtering. Additionally, we showcase its effectiveness through comprehensive case studies in large-scale $\textit{in silico}$ screening. Our codes and constructed benchmark data under various schemes are provided at:https://github.com/eugenebang/boundr_e.

Paperid:3115 poster

Authors:Simin Chen, Pranav Pusarla, Baishakhi Ray

Title: DyCodeEval: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

Abstract:
The rapid advancement of code large language models (Code LLMs) underscores the critical need for effective and transparent benchmarking methods. However, current benchmarking predominantly relies on publicly available, human-created datasets. The widespread use of these static benchmark datasets makes the evaluation process particularly susceptible to data contamination—an unavoidable consequence of the extensive data collection processes employed during LLM training. Existing methods for addressing data contamination typically face significant limitations, including reliance on substantial human effort and difficulty in managing class imbalances. To overcome these challenges, we propose DyCodeEval, a novel benchmarking suite specifically designed to evaluate Code LLMs under realistic contamination scenarios. Given an initial seed programming problem, DyCodeEval utilizes multiple agents to systematically extract and modify contextual information without changing the core logic, generating semantically equivalent variations. We introduce a dynamic data generation method and conduct extensive empirical studies on two seed datasets involving 18 Code LLMs. The results demonstrate that DyCodeEval effectively assesses the reasoning capabilities of Code LLMs under contamination conditions while producing diverse problem variants, thereby ensuring robust and consistent benchmarking outcomes.

Paperid:3116 poster

Authors:Tomer Meir, Uri Shalit, Malka Gorfine

Title: Heterogeneous Treatment Effect in Time-to-Event Outcomes: Harnessing Censored Data with Recursively Imputed Trees

Abstract:
Tailoring treatments to individual needs is a central goal in fields such as medicine. A key step toward this goal is estimating Heterogeneous Treatment Effects (HTE)—the way treatments impact different subgroups. While crucial, HTE estimation is challenging with survival data, where time until an event (e.g., death) is key. Existing methods often assume complete observation, an assumption violated in survival data due to right-censoring, leading to bias and inefficiency. Cui et al. (2023) proposed a doubly-robust method for HTE estimation in survival data under no hidden confounders, combining a causal survival forest with an augmented inverse-censoring weighting estimator. However, we find it struggles under heavy censoring, which is common in rare-outcome problems such as Amyotrophic lateral sclerosis (ALS). Moreover, most current methods cannot handle instrumental variables, which are a crucial tool in the causal inference arsenal. We introduce Multiple Imputation for Survival Treatment Response (MISTR), a novel, general, and non-parametric method for estimating HTE in survival data. MISTR uses recursively imputed survival trees to handle censoring without directly modeling the censoring mechanism. Through extensive simulations and analysis of two real-world datasets—the AIDS Clinical Trials Group Protocol 175 and the Illinois unemployment dataset we show that MISTR outperforms prior methods under heavy censoring in the no-hidden-confounders setting, and extends to the instrumental variable setting. To our knowledge, MISTR is the first non-parametric approach for HTE estimation with unobserved confounders via instrumental variables.

Paperid:3117 poster

Authors:Daeho Um, Sunoh Kim, Jiwoong Park, Jongin Lim, Seong Jin Ahn, Seulki Park

Title: Propagate and Inject: Revisiting Propagation-Based Feature Imputation for Graphs with Partially Observed Features

Abstract:
In this paper, we address learning tasks on graphs with missing features, enhancing the applicability of graph neural networks to real-world graph-structured data. We identify a critical limitation of existing imputation methods based on feature propagation: they produce channels with nearly identical values within each channel, and these low-variance channels contribute very little to performance in graph learning tasks. To overcome this issue, we introduce synthetic features that target the root cause of low-variance channel production, thereby increasing variance in these channels. By preventing propagation-based imputation methods from generating meaningless feature values shared across all nodes, our synthetic feature propagation scheme mitigates significant performance degradation, even under extreme missing rates. Extensive experiments demonstrate the effectiveness of our approach across various graph learning tasks with missing features, ranging from low to extremely high missing rates. Additionally, we provide both empirical evidence and theoretical proof to validate the low-variance problem. The source code is available athttps://github.com/daehoum1/fisf.

Paperid:3118 poster

Authors:Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas

Title: The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models Via Visual Information Steering

Abstract:
Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits rankings throughout the generation process, revealing three key patterns in how LVLMs process information: (1)gradual visual information loss-- visually grounded tokens gradually become less favored throughout generation, and (2)early excitation-- semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3)hidden genuine information-- visually grounded tokens though not being eventually decided still retain relatively high rankings at inference. Based on these insights, we proposeVISTA(VisualInformationSteering withToken-logitAugmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available athttps://github.com/LzVv123456/VISTA.

Paperid:3119 poster

Authors:Xin Cheng, Jiabo Ye, Haiyang Xu, Ming Yan, Ji Zhang, Feng Liu, Fei Huang, Lei Feng

Title: Exploiting Presentative Feature Distributions for Parameter-Efficient Continual Learning of Large Language Models

Abstract:
Endowing large language models (LLMs) with continual learning (CL) capacities is practically important, which enables them to dynamically acquire new knowledge over time. Although many effective methods have been proposed for CL of LLMs, they did not consider online scenarios, thereby sharing a common problem: information leakage (IL), where the task-related information of learned tasks is accessed or reused again. IL not only imposes potential risks on data privacy protection but also significantly hinders the deployment of LLMs in real-world scenarios. To avoid IL while maintaining outstanding CL performance, we propose a novel CL method for LLMs, which first characterizes a parameter-efficient fine-tuning (PEFT) block by a presentative feature distribution, and then dynamically selects the appropriate PEFT blocks for each instance based on its similarity with the presentative feature distributions. Extensive experiments validate the effectiveness of our method on the CL of LLM, showcasing its potential to enhance both privacy and adaptability in practical applications.

Paperid:3120 poster

Authors:Xuwei Xu, Yang Li, Yudong Chen, Jiajun Liu, Sen Wang

Title: RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

Abstract:
We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family ofReParameterizableVisionTransformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy66.8%and68.7%speed-ups with+1.7%and+1.1%higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available athttps://github.com/Ackesnal/RePaViT.

Paperid:3121 poster

Authors:Chen Huang, Skyler Seto, Hadi Pouransari, Mehrdad Farajtabar, Raviteja Vemulapalli, Fartash Faghri, Oncel Tuzel, Barry-John Theobald, Joshua M. Susskind

Title: Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting

Abstract:
Vision foundation models pre-trained on massive data encode rich representations of real-world concepts, which can be adapted to downstream tasks by fine-tuning. However, fine-tuning foundation models on one task often leads to the issue ofconcept forgettingon other tasks. Recent methods of robust fine-tuning aim to mitigate forgetting of prior knowledge without affecting the fine-tuning performance. Knowledge is often preserved by matching the original and fine-tuned model weights or feature pairs. However, such point-wise matching can be too strong, without explicit awareness of the feature neighborhood structures that encode rich knowledge as well. We propose a novel regularization methodProxy-FDAthat explicitly preserves the structural knowledge in feature space. Proxy-FDA performs Feature Distribution Alignment (using nearest neighbor graphs) between the pre-trained and fine-tuned feature spaces, and the alignment is further improved by informative proxies that are generated dynamically to increase data diversity. Experiments show that Proxy-FDA significantly reduces concept forgetting during fine-tuning, and we find a strong correlation between forgetting and a distributional distance metric (in comparison to L2 distance). We further demonstrate Proxy-FDA's benefits in various fine-tuning settings (end-to-end, few-shot and continual tuning) and across different tasks like image classification, captioning and VQA.

Paperid:3122 poster

Authors:Xinshuai Dong, Ignavier Ng, Boyang Sun, Haoyue Dai, Guang-Yuan Hao, Shunxing Fan, Peter Spirtes, Yumou Qiu, Kun Zhang

Title: Permutation-based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data

Abstract:
Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery (code will be available athttps://github.com/dongxinshuai/scm-identify).

Paperid:3123 poster

Authors:Chen Wei, Chi Zhang, Jiachen Zou, Haotian Deng, Dietmar Heinke, Quanying Liu

Title: Synthesizing Images on Perceptual Boundaries of ANNs for Uncovering and Manipulating Human Perceptual Variability

Abstract:
Human decision-making in cognitive tasks and daily life exhibits considerable variability, shaped by factors such as task difficulty, individual preferences, and personal experiences. Understanding this variability across individuals is essential for uncovering the perceptual and decision-making mechanisms that humans rely on when faced with uncertainty and ambiguity. We propose a systematic Boundary Alignment Manipulation (BAM) framework for studying human perceptual variability through image generation. BAM combines perceptual boundary sampling in ANNs and human behavioral experiments to systematically investigate this phenomenon. Our perceptual boundary sampling algorithm generates stimuli along ANN perceptual boundaries that intrinsically induce significant perceptual variability. The efficacy of these stimuli is empirically validated through large-scale behavioral experiments involving 246 participants across 116,715 trials, culminating in the variMNIST dataset containing 19,943 systematically annotated images. Through personalized model alignment and adversarial generation, we establish a reliable method for simultaneously predicting and manipulating the divergent perceptual decisions of pairs of participants. This work bridges the gap between computational models and human individual difference research, providing new tools for personalized perception analysis. Code and data for this work are publicly available.

Paperid:3124 poster

Authors:Hongcheng Guo, wangyue, Shaosheng Cao, Fei zhao, Boyang Wang, Lei Li, Liang Chen, Xinze Lyu, Zhe Xu, Yao Hu, Zhoujun Li

Title: SNS-Bench: Defining, Building, and Assessing Capabilities of Large Language Models in Social Networking Services

Abstract:
With the rapid advancement of Social Networking Services (SNS), the need for intelligent and efficient interaction within diverse platforms has become more crucial. Large Language Models (LLMs) play an important role in SNS as they possess the potential to revolutionize user experience, content generation, and communication dynamics. However, recent studies focus on isolated SNS tasks rather than a comprehensive evaluation. In this paper, we introduce SNS-Bench, specially constructed for assessing the abilities of large language models from different Social Networking Services, with a wide range of SNS-related information. SNS-Bench encompasses 8 different tasks such as note classification, query content relevance, and highlight words generation in comments. Finally, 6,658 questions of social media text, including subjective questions, single-choice, and multiple-choice questions, are concluded in SNS-Bench. Further, we evaluate the performance of over 25+ current diverse LLMs on our SNS-Bench. Models with different sizes exhibit performance variations, yet adhere to the scaling law. Moreover, we hope provide more insights to revolutionize the techniques of social network services with LLMs.

Paperid:3125 poster

Authors:Yan Liu, Mingjie Chen, Chaojie Ji, Hao Zhang, Ruxin Wang

Title: SERENA: A Unified Stochastic Recursive Variance Reduced Gradient Framework for Riemannian Non-Convex Optimization

Abstract:
Recently, the expansion of Variance Reduction (VR) to Riemannian stochastic non-convex optimization has attracted increasing interest. Inspired by recursive momentum, we first introduce Stochastic Recursive Variance Reduced Gradient (SRVRG) algorithm and further present Stochastic Recursive Gradient Estimator (SRGE) in Euclidean spaces, which unifies the prevailing variance reduction estimators. We then extend SRGE to Riemannian spaces, resulting in a unified Stochastic rEcursive vaRiance reducEd gradieNt frAmework (SERENA) for Riemannian non-convex optimization. This framework includes the proposed R-SRVRG, R-SVRRM, and R-Hybrid-SGD methods, as well as other existing Riemannian VR methods. Furthermore, we establish a unified theoretical analysis for Riemannian non-convex optimization under retraction and vector transport. The IFO complexity of our proposed R-SRVRG and R-SVRRM to converge to $\varepsilon$-accurate solution is $\mathcal{O}\left(\min {n^{1/2}{\varepsilon^{-2}}, \varepsilon^{-3}}\right)$ in the finite-sum setting and ${\mathcal{O}\left( \varepsilon^{-3}\right)}$ for the online case, both of which align with the lower IFO complexity bound. Experimental results indicate that the proposed algorithms surpass other existing Riemannian optimization methods.

Paperid:3126 poster

Authors:Blake Bordelon, Cengiz Pehlevan

Title: Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer

Abstract:
We theoretically characterize gradient descent dynamics in deep linear networks trained at large width from random initialization and on large quantities of random data. Our theory captures the ``wider is better" effect of mean-field/maximum-update parameterized networks as well as hyperparameter transfer effects, which can be contrasted with the neural-tangent parameterization where optimal learning rates shift with model width. We provide asymptotic descriptions of both non-residual and residual neural networks, the latter of which enables an infinite depth limit when branches are scaled as1/depth. We also compare training with one-pass stochastic gradient descent to the dynamics when training data are repeated at each iteration. Lastly, we show that this model recovers the accelerated power law training dynamics for power law structured data in the rich regime observed in recent works.

Paperid:3127 poster

Authors:Guoqing Zhang, Shichao Kan, Fanghui Zhang, Wanru Xu, Yue Zhang, Yigang Cen

Title: Noise-Guided Predicate Representation Extraction and Diffusion-Enhanced Discretization for Scene Graph Generation

Abstract:
Scene Graph Generation (SGG) is a fundamental task in visual understanding, aimed at providing more precise local detail comprehension for downstream applications. Existing SGG methods often overlook the diversity of predicate representations and the consistency among similar predicates when dealing with long-tail distributions. As a result, the model's decision layer fails to effectively capture details from the tail end, leading to biased predictions. To address this, we propose a Noise-Guided Predicate Representation Extraction and Diffusion-Enhanced Discretization (NoDIS) method. On the one hand, expanding the predicate representation space enhances the model's ability to learn both common and rare predicates, thus reducing prediction bias caused by data scarcity. We propose a conditional diffusion model to reconstructs features and increase the diversity of representations for same category predicates. On the other hand, independent predicate representations in the decision phase increase the learning complexity of the decision layer, making accurate predictions more challenging. To address this issue, we introduce a discretization mapper that learns consistent representations among similar predicates, reducing the learning difficulty and decision ambiguity in the decision layer. To validate the effectiveness of our method, we integrate NoDIS with various SGG baseline models and conduct experiments on multiple datasets. The results consistently demonstrate superior performance.

Paperid:3128 poster

Authors:Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, Francis Bach

Title: The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Abstract:
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.

Paperid:3129 poster

Authors:Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, Bo Han

Title: From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

Abstract:
While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning—where an LLM must interact with external systems to acquire missing evidence or data—has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM’s active reasoning skills. AR-Bench comprises three task families—detective cases, situation puzzles, and guessing numbers—that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at:https://github.com/tmlr-group/AR-Bench.

Paperid:3130 poster

Authors:Jixian Zhou, Fang Dong, Ruijun Huang, Hengjie Cao, Mengyi Chen, Yifeng Yang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Dongsheng Li, David A. Clifton, Qin Lv, Rui Zhu, Chun Zhang, Fan Yang, Tun Lu, Ning Gu, Li Shang

Title: Oracle-MoE: Locality-preserving Routing in the Oracle Space for Memory-constrained Large Language Model Inference

Abstract:
Mixture-of-Experts (MoE) is widely adopted to deploy Large Language Models (LLMs) on edge devices with limited memory budgets. Although MoE is, in theory, an inborn memory-friendly architecture requiring only a few activated experts to reside in the memory for inference, current MoE architectures cannot effectively fulfill this advantage and will yield intolerable inference latencies of LLMs on memory-constrained devices. Our investigation pinpoints the essential cause as the remarkable temporal inconsistencies of inter-token expert activations, which generate overly frequent expert swapping demands dominating the latencies. To this end, we propose a novel MoE architecture, Oracle-MoE, to fulfill the real on-device potential of MoE-based LLMs. Oracle-MoE route tokens in a highly compact space suggested by attention scores, termed theoracle space, to effectively maintain the semantic locality across consecutive tokens to reduce expert activation variations, eliminating massive swapping demands. Theoretical analysis proves that Oracle-MoE is bound to provide routing decisions with better semantic locality and, therefore, better expert activation consistencies. Experiments on the pretrained GPT-2 architectures of different sizes (200M, 350M, 790M, and 2B) and downstream tasks demonstrate that without compromising task performance, our Oracle-MoE has achieved state-of-the-art inference speeds across varying memory budgets, revealing its substantial potential for LLM deployments in industry.

Paperid:3131 poster

Authors:Kei Sen Fong, Mehul Motani

Title: FEAT-KD: Learning Concise Representations for Single and Multi-Target Regression via TabNet Knowledge Distillation

Abstract:
In this work, we propose a novel approach that combines the strengths of FEAT and TabNet through knowledge distillation (KD), which we term FEAT-KD. FEAT is an intrinsically interpretable machine learning (ML) algorithm that constructs a weighted linear combination of concisely-represented features discovered via genetic programming optimization, which can often be inefficient. FEAT-KD leverages TabNet's deep-learning-based optimization and feature selection mechanisms instead. FEAT-KD finds a weighted linear combination of concisely-represented, symbolic features that are derived from piece-wise distillation of a trained TabNet model. We analyze FEAT-KD on regression tasks from two perspectives: (i) compared to TabNet, FEAT-KD significantly reduces model complexity while retaining competitive predictive performance, effectively converting a black-box deep learning model into a more interpretable white-box representation, (ii) compared to FEAT, our method consistently outperforms in prediction accuracy, produces more compact models, and reduces the complexity of learned symbolic expressions. In addition, we demonstrate that FEAT-KD easily supports multi-target regression, in which the shared features contribute to the interpretability of the system. Our results suggest that FEAT-KD is a promising direction for interpretable ML, bridging the gap between deep learning's predictive power and the intrinsic transparency of symbolic models.

Paperid:3132 poster

Authors:Alejandro Velez-Arce, Marinka Zitnik

Title: PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models

Abstract:
Existing biomedical benchmarks do not provide end-to-end infrastructure for training, evaluation, and inference of models that integrate multimodal biological data and a broad range of machine learning tasks in therapeutics. We present PyTDC, an open-source machine-learning platform providing streamlined training, evaluation, and inference software for multimodal biological AI models. PyTDC unifies distributed, heterogeneous, continuously updated data sources and model weights and standardizes benchmarking and inference endpoints. This paper discusses the components of PyTDC's architecture and, to our knowledge, the first-of-its-kind case study on the introduced single-cell drug-target nomination ML task. We find state-of-the-art methods in graph representation learning and domain-specific methods from graph theory perform poorly on this task. Though we find a context-aware geometric deep learning method that outperforms the evaluated SoTA and domain-specific baseline methods, the model is unable to generalize to unseen cell types or incorporate additional modalities, highlighting PyTDC's capacity to facilitate an exciting avenue of research developing multimodal, context-aware, foundation models for open problems in biomedical AI.

Paperid:3133 poster

Authors:Yuhan Helena Liu, Guangyu Robert Yang, Christopher J Cueva

Title: Can Biologically Plausible Temporal Credit Assignment Rules Match BPTT for Neural Similarity? E-prop as an Example

Abstract:
Understanding how the brain learns may be informed by studying biologically plausible learning rules. These rules, often approximating gradient descent learning to respect biological constraints such as locality, must meet two critical criteria to be considered an appropriate brain model: (1) good neuroscience task performance and (2) alignment with neural recordings. While extensive research has assessed the first criterion, the second remains underexamined. Employing methods such as Procrustes analysis on well-known neuroscience datasets, this study demonstrates the existence of a biologically plausible learning rule �?namely e-prop, which is based on gradient truncation and has demonstrated versatility across a wide range of tasks �?that can achieve neural data similarity comparable to Backpropagation Through Time (BPTT) when matched for task accuracy. Our findings also reveal that model architecture and initial conditions can play a more significant role in determining neural similarity than the specific learning rule. Furthermore, we observe that BPTT-trained models and their biologically plausible counterparts exhibit similar dynamical properties at comparable accuracies. These results underscore the substantial progress made in developing biologically plausible learning rules, highlighting their potential to achieve both competitive task performance and neural data similarity.

Paperid:3134 poster

Authors:Feifei Kou, Jiahao Wang, Lei Shi, Yuhan Yao, Yawen Li, Suguo Zhu, Zhongbao Zhang, Junping Du

Title: CFPT: Empowering Time Series Forecasting through Cross-Frequency Interaction and Periodic-Aware Timestamp Modeling

Abstract:
Long-term time series forecasting has been widely studied, yet two aspects remain insufficiently explored: the interaction learning between different frequency components and the exploitation of periodic characteristics inherent in timestamps. To address the above issues, we proposeCFPT, a novel method that empowering time series forecasting throughCross-Frequency Interaction (CFI) andPeriodic-AwareTimestamp Modeling (PTM). To learn cross-frequency interactions, we design the CFI branch to process signals in frequency domain and captures their interactions through a feature fusion mechanism. Furthermore, to enhance prediction performance by leveraging timestamp periodicity, we develop the PTM branch which transforms timestamp sequences into 2D periodic tensors and utilizes 2D convolution to capture both intra-period dependencies and inter-period correlations of time series based on timestamp patterns. Extensive experiments on multiple real-world benchmarks demonstrate that CFPT achieves state-of-the-art performance in long-term forecasting tasks. The code is publicly available at this repository:https://github.com/BUPT-SN/CFPT.

Paperid:3135 poster

Authors:Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang

Title: Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences

Abstract:
Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected:preferences vary across individuals and should be represented with more granularity.To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Experimental results demonstrate that our method achieves state-of-the-art performance in preference evaluation tasks, surpassing baselines across various metrics, while reducing the training costs.

Paperid:3136 poster

Authors:Shuqing Luo, Pingzhi Li, Jie Peng, Yang Zhao, Yu Cao, Yu Cheng, Tianlong Chen

Title: Occult: Optimizing Collaborative Communications across Experts for Accelerated Parallel MoE Training and Inference

Abstract:
Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over 40% runtime in large-scale training). In this paper, we first definecollaborative communicationto illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them ascollaborated, which comprises2cases asintra-andinter-collaboration, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallel at scale. It motivates us to strategicallyo―ptimizec―ollaborativec―ommu―nication for accel―erat―ed MoE training and inference, dubbedOccult. Our designs are capable ofeither―delivering exact results with reduced communication cost,or―controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate thatOccultcan be faster than popular state-of-the-art inference or training frameworks (over 50% speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning. Codes will be available upon acceptance.

Paperid:3137 poster

Authors:Jeongmo Kim, Yisak Park, Minung Kim, Seungyul Han

Title: Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

Abstract:
Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available athttps://github.com/JM-Kim-94/tavt.git.

Paperid:3138 poster

Authors:Ngoc-Quan Pham, Tuan Truong, Quyen Tran, Tan Minh Nguyen, Dinh Phung, Trung Le

Title: Promoting Ensemble Diversity with Interactive Bayesian Distributional Robustness for Fine-tuning Foundation Models

Abstract:
We introduce Interactive Bayesian Distributional Robustness (IBDR), a novel Bayesian inference framework that allows modeling the interactions between particles, thereby enhancing ensemble quality through increased particle diversity. IBDR is grounded in a generalized theoretical framework that connects the distributional population loss with the approximate posterior, motivating a practical dual optimization procedure that enforces distributional robustness while fostering particle diversity. We evaluate IBDR's performance against various baseline methods using the VTAB-1K benchmark and the common reasoning language task. The results consistently show that IBDR outperforms these baselines, underscoring its effectiveness in real-world applications.

Paperid:3139 poster

Authors:Xiangyu Qu, guojing liu, Liang Li

Title: AEQA-NAT : Adaptive End-to-end Quantization Alignment Training Framework for Non-autoregressive Machine Translation

Abstract:
Non-autoregressive Transformers (NATs) have garnered significant attention due to their efficient decoding compared to autoregressive methods. However, existing conditional dependency modeling schemes based on masked language modeling introduce atraining-inference gapin NATs. For instance, while NATs sample target words during training to enhance input, this condition cannot be met during inference, and simply annealing the sampling rate to zero during training leads to model performance degradation. We demonstrate that thistraining-inference gapprevents NATs from fully realizing their potential. To address this, we propose an adaptive end-to-end quantization alignment training framework, which introduces a semantic consistency space to adaptively align NAT training, eliminating the need for target information and thereby bridging thetraining-inference gap.Experimental results demonstrate that our method outperforms most existing fully NAT models, delivering performance on par with Autoregressive Transformer (AT) while being 17.0 times more efficient in inference.

Paperid:3140 poster

Authors:Jiachen Hu, Rui Ai, Han Zhong, Xiaoyu Chen, Liwei Wang, Zhaoran Wang, Zhuoran Yang

Title: The Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability

Abstract:
Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $\tilde{O}(1/\epsilon^2)$.

Paperid:3141 poster

Authors:Zichen Wang, Chuanhao Li, Huazheng Wang

Title: Provably Efficient Algorithm for Best Scoring Rule Identification in Online Principal-Agent Information Acquisition

Abstract:
We investigate the problem of identifying the optimal scoring rule within the principal-agent framework for online information acquisition problem. We focus on the principal's perspective, seeking to determine the desired scoring rule through interactions with the agent. To address this challenge, we propose two algorithms: OIAFC and OIAFB, tailored for fixed confidence and fixed budget settings, respectively. Our theoretical analysis demonstrates that OIAFC can extract the desired $(\epsilon, \delta)$-scoring rule with a efficient instance-dependent sample complexity or an instance-independent sample complexity. Our analysis also shows that OIAFB matches the instance-independent performance bound of OIAFC, while both algorithms share the same complexity across fixed confidence and fixed budget settings.

Paperid:3142 poster

Authors:Yifu Yuan, Zhenrui Zheng, Zibin Dong, Jianye HAO

Title: MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

Abstract:
Multi-objective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives, but it requires extensive online interactions. Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment. However, real-world offline datasets are often conservatively and narrowly distributed, failing to comprehensively cover preferences, leading to the emergence of out-of-distribution (OOD) preference areas. Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences. Leveraging the excellent expressive and generalization capabilities of diffusion models, we propose MODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences and derive action for decision-making. To achieve accurate generation, MODULI introduces two return normalization methods under diverse preferences for refining guidance. To further enhance generalization to OOD preferences, MODULI proposes a novel sliding guidance mechanism, which involves training an additional slider adapter to capture the direction of preference changes. Incorporating the slider, it transitions from in-distribution (ID) preferences to generating OOD preferences, patching, and extending the incomplete Pareto front. Extensive experiments on the D4MORL benchmark demonstrate that our algorithm outperforms state-of-the-art Offline MORL baselines, exhibiting excellent generalization to OOD preferences.

Paperid:3143 poster

Authors:Changshuo Wang, Xiang Fang, Prayag Tiwari

Title: DyPolySeg: Taylor Series-Inspired Dynamic Polynomial Fitting Network for Few-shot Point Cloud Semantic Segmentation

Abstract:
Few-shot point cloud semantic segmentation effectively addresses data scarcity by identifying unlabeled query samples through semantic prototypes generated from a small set of labeled support samples. However, pre-training-based methods suffer from domain shifts and increased training time. Additionally, existing methods using DGCNN as the backbone have limited geometric structure modeling capabilities and struggle to bridge the categorical information gap between query and support sets. To address these challenges, we propose DyPolySeg, a pre-training-free Dynamic Polynomial fitting network for few-shot point cloud semantic segmentation. Specifically, we design a unified Dynamic Polynomial Convolution (DyPolyConv) that extracts flat and detailed features of local geometry through Low-order Convolution (LoConv) and Dynamic High-order Convolution (DyHoConv), complemented by Mamba Block for capturing global context information. Furthermore, we propose a lightweight Prototype Completion Module (PCM) that reduces structural differences through self-enhancement and interactive enhancement between query and support sets. Experiments demonstrate that DyPolySeg achieves state-of-the-art performance on S3DIS and ScanNet datasets.

Paperid:3144 poster

Authors:Susan Liang, Dejan Markovic, Israel D. Gebru, Steven Krenn, Todd Keebler, Jacob Sandakly, Frank Yu, Samuel Hassel, Chenliang Xu, Alexander Richard

Title: BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

Abstract:
Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.

Paperid:3145 poster

Authors:Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li

Title: MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Abstract:
Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduceMME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights:1)Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results;2)CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and3)Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.

Paperid:3146 poster

Authors:Gengluo Li, Huawen Shen, Yu Zhou

Title: Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts

Abstract:
Chinese scene text retrieval is a practical task that aims to search for images containing visual instances of a Chinese query text. This task is extremely challenging because Chinese text often features complex and diverse layouts in real-world scenes. Current efforts tend to inherit the solution for English scene text retrieval, failing to achieve satisfactory performance. In this paper, we establish a Diversified Layout benchmark for Chinese Street View Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval performance across various text layouts, including vertical, cross-line, and partial alignments. To address the limitations in existing methods, we propose Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates global visual information with multi-granularity alignment training. CSTR-CLIP applies a two-stage training process to overcome previous limitations, such as the exclusion of visual features outside the text region and reliance on single-granularity alignment, thereby enabling the model to effectively handle diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP outperforms the previous state-of-the-art model by 18.82% accuracy and also provides faster inference speed. Further analysis on DL-CSVTR confirms the superior performance of CSTR-CLIP in handling various text layouts. The dataset and code will be publicly available to facilitate research in Chinese scene text retrieval.

Paperid:3147 poster

Authors:Arya Fayyazi, Mehdi Kamal, Massoud Pedram

Title: FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering for Enabling Fair LLM-Based Recommender Systems

Abstract:
We propose FACTER, a fairness-aware framework for LLM-based recommendation systems that integrates conformal prediction with dynamic prompt engineering. By introducing an adaptive semantic variance threshold and a violation-triggered mechanism, FACTER automatically tightens fairness constraints whenever biased patterns emerge. We further develop an adversarial prompt generator that leverages historical violations to reduce repeated demographic biases without retraining the LLM. Empirical results on MovieLens and Amazon show that FACTER substantially reduces fairness violations (up to 95.5%) while maintaining strong recommendation accuracy, revealing semantic variance as a potent proxy of bias.

Paperid:3148 poster

Authors:Jie Peng, Jenna L. Ballard, Mohan Zhang, Sukwon Yun, Jiayi Xin, Qi Long, Yanyong Zhang, Tianlong Chen

Title: Modalities Contribute Unequally: Enhancing Medical Multi-modal Learning through Adaptive Modality Token Re-balancing

Abstract:
Medical multi-modal learning requires an effective fusion capability of various heterogeneous modalities. One vital challenge is how to effectively fuse modalities when their data quality varies across different modalities and patients. For example, in the TCGA benchmark, the performance of the same modality can differ between types of cancer. Moreover, data collected at different times, locations, and with varying reagents can introduce inter-modal data quality differences ($i.e.$, $\textbf{Modality Batch Effect}$). In response, we propose ${\textbf{A}}$daptive ${\textbf{M}}$odality Token Re-Balan${\textbf{C}}$ing ($\texttt{AMC}$), a novel top-down dynamic multi-modal fusion approach. The core of $\texttt{AMC}$ is to quantify the significance of each modality (Top) and then fuse them according to the modality importance (Down). Specifically, we access the quality of each input modality and then replace uninformative tokens with inter-modal tokens, accordingly. The more important a modality is, the more informative tokens are retained from that modality. The self-attention will further integrate these mixed tokens to fuse multi-modal knowledge. Comprehensive experiments on both medical and general multi-modal datasets demonstrate the effectiveness and generalizability of $\texttt{AMC}$.

Paperid:3149 poster

Authors:Harvineet Singh, Fan Xia, Alexej Gossmann, Andrew Chuang, Julian C. Hong, Jean Feng

Title: "Who experiences large model decay and why?" A Hierarchical Framework for Diagnosing Heterogeneous Performance Drift

Abstract:
Machine learning (ML) models frequently experience performance degradation when deployed in new contexts. Such degradation is rarely uniform: some subgroups may suffer large performance decay while others may not. Understanding where and how large differences in performance arise is critical for designingtargetedcorrective actions that mitigate decay for the most affected subgroups while minimizing any unintended effects. Current approaches do not provide such detailed insight, as they either (i) explain howaverageperformance shifts arise or (ii) identify adversely affected subgroups without insight into how this occurred. To this end, we introduce aSubgroup-scanningHierarchicalInferenceFramework for performance drifT(SHIFT). SHIFT first asks "Is there any subgroup with unacceptably large performance decay due to covariate/outcome shifts?" (Where?) and, if so, dives deeper to ask "Can we explain this using more detailed variable(subset)-specific shifts?" (How?). In real-world experiments, we find that SHIFT identifies interpretable subgroups affected by performance decay, and suggests targeted actions that effectively mitigate the decay.

Paperid:3150 poster

Authors:Matteo Saponati, Pascal Josef Sager, Pau Vilimelis Aceituno, Thilo Stadelmann, Benjamin F Grewe

Title: The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training

Abstract:
Self-attention is essential to Transformer architectures, yet how information is embedded in the self-attention matrices and how different objective functions impact this process remains unclear. We present a mathematical framework to analyze self-attention matrices by deriving the structures governing their weight updates. Using this framework, we demonstrate that bidirectional training induces symmetry in the weight matrices, while autoregressive training results in directionality and column dominance. Our theoretical findings are validated across multiple Transformer models �?including ModernBERT, GPT, LLaMA3, and Mistral �?and input modalities like text, vision, and audio. Finally, we apply these insights by showing that symmetric initialization improves the performance of encoder-only models on language tasks. This mathematical analysis offers a novel theoretical perspective on how information is embedded through self-attention, thereby improving the interpretability of Transformer models.

Paperid:3151 poster

Authors:Detian Zhang, Chengqiang Zhang, Yanghui Rao, Li Qing, Chunjiang Zhu

Title: Improved Expressivity of Hypergraph Neural Networks through High-Dimensional Generalized Weisfeiler-Leman Algorithms

Abstract:
The isomorphism problem is a key challenge in both graph and hypergraph domains, crucial for applications like protein design, chemical pathways, and community detection. Hypergraph isomorphism, which models high-order relationships in real-world scenarios, remains underexplored compared to the graph isomorphism. Current algorithms for hypergraphs, like the 1-dimensional generalized Weisfeiler-Lehman test (1-GWL), lag behind advancements in graph isomorphism tests, limiting most hypergraph neural networks to 1-GWL's expressive power. To address this, we propose the high-dimensional GWL (k-GWL), generalizing k-WL from graphs to hypergraphs. We prove that k-GWL reduces to k-WL for simple graphs, and thus develop a unified isomorphism method for both graphs and hypergraphs. We also successfully establish a clear and complete understanding of the GWL hierarchy of expressivity, showing that (k+1)-GWL is more expressive than k-GWL with illustrative examples. Based on k-GWL, we develop a hypergraph neural network model named k-HNN with improved expressive power of k-GWL, which achieves superior performance on real-world datasets, including a 6% accuracy improvement on the Steam-Player dataset over the runner-up. Our code is available athttps://github.com/talence-zcq/KGWL.

Paperid:3152 poster

Authors:Sanjeev Raja, Martin Sipka, Michael Psenka, Tobias Kreiman, Michal Pavelka, Aditi S. Krishnapriyan

Title: Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional

Abstract:
Transition path sampling (TPS), which involves finding probable paths connecting two points on an energy landscape, remains a challenge due to the complexity of real-world atomistic systems. Current machine learning approaches rely on expensive training procedures and under-utilize growing quantities of atomistic data, limiting scalability and generalization. Generative models of atomistic conformational ensembles sample temporally independent states from energy landscapes, but their application to TPS remains mostly unexplored. In this work, we address TPS by interpreting candidate paths as trajectories sampled from stochastic dynamics induced by the learned score function of generative models, namely denoising diffusion and flow matching. Under these dynamics, finding high-likelihood transition paths becomes equivalent to minimizing the Onsager-Machlup (OM) action functional, enabling us to repurpose pre-trained generative models for TPS in a zero-shot fashion. We demonstrate our approach on a Müller-Brown potential and several fast-folding proteins, where we obtain diverse, physically realistic transition pathways, as well as tetrapeptides, where we demonstrate successful TPS on systems not seen by the generative model during training. Our method can be easily incorporated into new generative models, making it practically relevant as models continue to scale and improve.

Paperid:3153 poster

Authors:Mahendra Singh Thapa, Rui Li

Title: Harnessing Heterogeneous Statistical Strength for Personalized Federated Learning via Hierarchical Bayesian Inference

Abstract:
Personalized federated learning (PFL) based on Bayesian approach tackle the challenges from statistical heterogeneity of client data by computing a personalized posterior distribution over the parameters of each client's local model and constructing a global distribution by aggregating the parameters of these personalized posteriors. However, the heuristic aggregation methods introduce strong biases and result in global models with poor generalization. We thus propose a novel hierarchical Bayesian inference framework for PFL by specifying a conjugate hyper-prior over the parameters of the personalized posteriors. This allows us to jointly compute a global posterior distribution for aggregation and the personalized ones at local level. This hierarchical Bayesian inference framework achieves elegant balance between local personalization and global model robustness. Extensive empirical study shows that by effectively sharing the heterogeneous statistical strength across the local models while retaining their distinctive characteristics, our framework yields state-of-the-art performance. We also show that existing Bayesian PFLs are special cases of our framework.

Paperid:3154 poster

Authors:Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue WANG, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao

Title: Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

Abstract:
Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication.Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation.While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.

Paperid:3155 poster

Authors:Boyuan Wu, ZEFENG WANG, Xianwei Lin, Jiachun Xu, Jikai Yu, Zhou Shicheng, Hongda Chen, Lianxin Hu

Title: Distributed Parallel Gradient Stacking(DPGS): Solving Whole Slide Image Stacking Challenge in Multi-Instance Learning

Abstract:
Whole Slide Image (WSI) analysis is framed as a Multiple Instance Learning (MIL) problem, but existing methods struggle with non-stackable data due to inconsistent instance lengths, which degrades performance and efficiency. We propose a Distributed Parallel Gradient Stacking (DPGS) framework with Deep Model-Gradient Compression (DMGC) to address this. DPGS enables lossless MIL data stacking for the first time, while DMGC accelerates distributed training via joint gradient-model compression. Experiments on Camelyon16 and TCGA-Lung datasets demonstrate up to 31× faster training, up to a 99.2% reduction in model communication size at convergence, and up to a 9.3% improvement in accuracy compared to the baseline. To our knowledge, this is the first work to solve non-stackable data in MIL while improving both speed and accuracy.

Paperid:3156 poster

Authors:Shaobin Zhuang, Yiwei Guo, Yanbo Ding, Kunchang Li, Xinyuan Chen, Yaohui Wang, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang

Title: TimeStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision

Abstract:
Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.

Paperid:3157 poster

Authors:Eshika Saxena, Alberto Alfarano, Emily Wenger, Kristin E. Lauter

Title: Making Hard Problems Easier with Custom Data Distributions and Loss Regularization: A Case Study in Modular Arithmetic

Abstract:
Recent work showed that ML-based attacks on Learning with Errors (LWE), a hard problem used in post-quantum cryptography, outperform classical algebraic attacks in certain settings. Although promising, ML attacks struggle to scale to more complex LWE settings. Prior work connected this issue to the difficulty of training ML models to do modular arithmetic, a core feature of the LWE problem. To address this, we develop techniques that significantly boost the performance of ML models on modular arithmetic tasks—enabling the models to sum up to $N=128$ elements modulo $q \le 974269$. Our core innovation is the use of custom training data distributions and a carefully designed loss function that better represents the problem structure. We apply an initial proof of concept of our techniques to LWE specifically and find that they allow recovery of 2x harder secrets than prior work. Our techniques also help ML models learn other well-studied problems better, including copy, associative recall, and parity, motivating further study.

Paperid:3158 poster

Authors:Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, Markus Freitag

Title: Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination’s Impact on Machine Translation

Abstract:
Data contamination—the accidental consumption of evaluation examples within the pre-training data—can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.

Paperid:3159 poster

Authors:Ningyuan Teresa Huang, Miguel Sarabia, Abhinav Moudgil, Pau Rodriguez, Luca Zappella, Federico Danieli

Title: Understanding Input Selectivity in Mamba: Impact on Approximation Power, Memorization, and Associative Recall Capacity

Abstract:
State-Space Models (SSMs), and particularly Mamba, have recently emerged as a promising alternative to Transformers. Mamba introduces input selectivity to its SSM layer (S6) and incorporates convolution and gating into its block definition. While these modifications do improve Mamba's performance over its SSM predecessors, it remains largely unclear how Mamba leverages the additional functionalities provided by input selectivity, and how these interact with the other operations in the Mamba architecture. In this work, we demystify the role of input selectivity in Mamba, investigating its impact on function approximation power, long-term memorization, and associative recall capabilities. In particular: (i) we prove that the S6 layer of Mamba can represent projections ontoHaar wavelets, providing an edge over its Diagonal SSM (S4D) predecessor in approximating discontinuous functions commonly arising in practice; (ii) we show how the S6 layer can dynamically counteract memory decay; (iii) we provide analytical solutions to the MQAR associative recall task using the Mamba architecture with different mixers --- Mamba, Mamba-2, and S4D. We demonstrate the tightness of our theoretical constructions with empirical results on concrete tasks. Our findings offer a mechanistic understanding of Mamba and reveal opportunities for improvement.

Paperid:3160 poster

Authors:Atefeh Sohrabizadeh, Jialin Song, Mingjie Liu, Rajarshi Roy, Chankyu Lee, Jonathan Raiman, Bryan Catanzaro

Title: Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity

Abstract:
Large Language Models (LLMs) have demonstrated significant potential in code generation by following natural language instructions. Unfortunately, crucial real-world software engineering tasks, such as debugging or repository-level feature implementation, involve processing extensive contexts beyond current LLM context sizes and performing complex reasoning that is brittle using standard autoregressive decoding. Enhancing LLMs' performance in these scenarios requires careful consideration of the contextual information provided to the model, optimizing how the model leverages that, and identifying tools that enable more effective navigation of the development environment. To address these challenges, we introduce Nemotron-CORTEXA, an agentic system built on a predefined scaffold that enhances LLMs' ability to navigate and reason efficiently in complex software engineering contexts. Specifically, we develop a novel code embedding model that retrieves the most relevant files with greater precision, along with a localization agent that refines the granularity of the retrieval process. Additionally, we demonstrate that providing diverse contextual information and utilizing different prompt formats enable the model to identify and resolve issues more efficiently. We evaluate Nemotron-CORTEXA using SWE-bench, a benchmark derived from real-world GitHub issues. Compared to the widely used Agentless framework, Nemotron-CORTEXA achieves a higher issue resolution rate at a lower cost, highlighting its practical impact in addressing real-world software engineering challenges.

Paperid:3161 poster

Authors:Xinyang Li, Siqi Liu, Bochao Zou, Jiansheng Chen, Huimin Ma

Title: From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models

Abstract:
As large language models evolve, there is growing anticipation that they will emulate human-like Theory of Mind (ToM) to assist with routine tasks. However, existing methods for evaluating machine ToM focus primarily on unimodal models and largely treat these models as black boxes, lacking an interpretative exploration of their internal mechanisms. In response, this study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of ToM in multimodal large language models (MLLMs). Specifically, we first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives. Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities. Furthermore, we present a lightweight, training-free approach that significantly enhances the model’s exhibited ToM by adjusting in the direction of the attention head.

Paperid:3162 poster

Authors:Arhit Chakrabarti, Yang Ni, Debdeep Pati, Bani Mallick

Title: Global-Local Dirichlet Processes for Clustering Grouped Data in the Presence of Group-Specific Idiosyncratic Variables

Abstract:
We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is quite common; for example, in cancer genomic studies, molecular information is available for all cancers whereas cancer-specific clinical information may only be available for certain cancers. Existing grouped clustering methods only consider the shared variables but ignore valuable information from the group-specific variables. To allow for these group-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the "global-local" structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model. We theoretically quantify the approximation errors of the truncated prior, the corresponding finite mixture model, and the associated posterior distribution. We develop a fast variational Bayes algorithm for scalable posterior inference, which we illustrate with extensive simulations and a TCGA pan-gastrointestinal cancer dataset.

Paperid:3163 poster

Authors:Haozhe Ma, Fangling Li, Jing Yu Lim, Zhengding Luo, Thanh Vinh Vo, Tze-Yun Leong

Title: Catching Two Birds with One Stone: Reward Shaping with Dual Random Networks for Balancing Exploration and Exploitation

Abstract:
Existing reward shaping techniques for sparse-reward reinforcement learning generally fall into two categories: novelty-based exploration bonuses and significance-based hidden state values. The former promotes exploration but can lead to distraction from task objectives, while the latter facilitates stable convergence but often lacks sufficient early exploration. To address these limitations, we propose Dual Random Networks Distillation (DuRND), a novel reward shaping framework that efficiently balances exploration and exploitation in a unified mechanism. DuRND leverages two lightweight random network modules to simultaneously compute two complementary rewards: a novelty reward to encourage directed exploration and a contribution reward to assess progress toward task completion. With low computational overhead, DuRND excels in high-dimensional environments with challenging sparse rewards, such as Atari, VizDoom, and MiniWorld, outperforming several benchmarks.

Paperid:3164 poster

Authors:Fanmeng Wang, Minjie Cheng, Hongteng Xu

Title: WGFormer: An SE(3)-Transformer Driven by Wasserstein Gradient Flows for Molecular Ground-State Conformation Prediction

Abstract:
Predicting molecular ground-state conformation (i.e., energy-minimized conformation) is crucial for many chemical applications such as molecular docking and property prediction. Classic energy-based simulation is time-consuming when solving this problem, while existing learning-based methods have advantages in computational efficiency but sacrifice accuracy and interpretability. In this work, we propose a novel and effective method to bridge the energy-based simulation and the learning-based strategy, which designs and learns a Wasserstein gradient flow-driven SE(3)-Transformer, called WGFormer, for ground-state conformation prediction. Specifically, our method tackles this task within an auto-encoding framework, which encodes low-quality conformations by the proposed WGFormer and decodes corresponding ground-state conformations by an MLP. The architecture of WGFormer corresponds to Wasserstein gradient flows --- it optimizes conformations by minimizing an energy function defined on the latent mixture models of atoms, thereby significantly improving performance and interpretability. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art competitors, providing a new and insightful paradigm to predict ground-state conformation. The code is available athttps://github.com/FanmengWang/WGFormer.

Paperid:3165 poster

Authors:Jian Gao, Weidong Cao, Xuan Zhang

Title: AnalogGenie-Lite: Enhancing Scalability and Precision in Circuit Topology Discovery through Lightweight Graph Modeling

Abstract:
The sustainable performance improvements of integrated circuits (ICs) drive the continuous advancement of nearly all transformative technologies. Since its invention, IC performance enhancements have been dominated by scaling the semiconductor technology. Yet, as Moore's law tapers off, a crucial question arises:How can we sustain IC performance in the post-Moore era?Creating new circuit topologies has emerged as a promising pathway to address this fundamental need. This work proposes AnalogGenie-Lite, a decoder-only transformer that discovers novel analog IC topologies with significantly enhanced scalability and precision via lightweight graph modeling. AnalogGenie-Lite makes several unique contributions, including concise device-pin representations (i.e., advancing the best prior art fromO(n2)toO(n)), frequent sub-graph mining, and optimal sequence modeling. Compared to state-of-the-art circuit topology discovery methods, it achieves5.15×to71.11×gains in scalability and 23.5% to 33.6% improvements in validity. Case studies on other domains' graphs are also provided to show the broader applicability of the proposed graph modeling approach. Source code:https://github.com/xz-group/AnalogGenie-Lite.

Paperid:3166 poster

Authors:Jiachen Yang, Yusong Wang, Yanmei Fang, Yunshu Dai, Fangjun Huang

Title: Variance as a Catalyst: Efficient and Transferable Semantic Erasure Adversarial Attack for Customized Diffusion Models

Abstract:
Latent Diffusion Models (LDMs) enable fine-tuning with only a few images and have become widely used on the Internet. However, it can also be misused to generate fake images, leading to privacy violations and social risks. Existing adversarial attack methods primarily introduce noise distortions to generated images but fail to completely erase identity semantics. In this work, we identify the variance of VAE latent code as a key factor that influences image distortion. Specifically, larger variances result in stronger distortions and ultimately erase semantic information. Based on this finding, we propose a Laplace-based (LA) loss function that optimizes along the fastest variance growth direction, ensuring each optimization step is locally optimal. Additionally, we analyze the limitations of existing methods and reveal that their loss functions often fail to align gradient signs with the direction of variance growth. They also struggle to ensure efficient optimization under different variance distributions. To address these issues, we further propose a novel Lagrange Entropy-based (LE) loss function. Experimental results demonstrate that our methods achieve state-of-the-art performance on CelebA-HQ and VGGFace2. Both proposed loss functions effectively lead diffusion models to generate pure-noise images with identity semantics completely erased. Furthermore, our methods exhibit strong transferability across diverse models and efficiently complete attacks with minimal computational resources. Our work provides a practical and efficient solution for privacy protection.

Paperid:3167 poster

Authors:Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, Yu Cheng

Title: Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning

Abstract:
While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative frameworkGLIDER(GroundingLanguage Models as EffIcientDecision-Making Agents via Offline HiErarchicalReinforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.

Paperid:3168 poster

Authors:Ning Liu, Yue Yu

Title: Neural Interpretable PDEs: Harmonizing Fourier Insights with Attention for Scalable and Interpretable Physics Discovery

Abstract:
Attention mechanisms have emerged as transformative tools in core AI domains such as natural language processing and computer vision. Yet, their largely untapped potential for modeling intricate physical systems presents a compelling frontier. Learning such systems often entails discovering operators that map between functional spaces using limited instances of function pairs---a task commonly framed as a severely ill-posed inverse PDE problem. In this work, we introduce Neural Interpretable PDEs (NIPS), a novel neural operator architecture that builds upon and enhances Nonlocal Attention Operators (NAO) in both predictive accuracy and computational efficiency. NIPS employs a linear attention mechanism to enable scalable learning and integrates a learnable kernel network that acts as a channel-independent convolution in Fourier space. As a consequence, NIPS eliminates the need to explicitly compute and store large pairwise interactions, effectively amortizing the cost of handling spatial interactions into the Fourier transform. Empirical evaluations demonstrate that NIPS consistently surpasses NAO and other baselines across diverse benchmarks, heralding a substantial leap in scalable, interpretable, and efficient physics learning. Our code and data accompanying this paper are available athttps://github.com/fishmoon1234/Nonlocal-Attention-Operator.

Paperid:3169 poster

Authors:Konstantin Donhauser, Kristina Ulicna, Gemma Elyse Moran, Aditya Ravuri, Kian Kenyon-Dean, Cian Eastwood, Jason Hartford

Title: Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models

Abstract:
Sparse dictionary learning (DL) has emerged as a powerful approach to extract semantically meaningful concepts from the internals of large language models (LLMs) trained mainly in the text domain. In this work, we explore whether DL can extract meaningful concepts from less human-interpretable scientific data, such as vision foundation models trained on cell microscopy images, where limited prior knowledge exists about which high-level concepts should arise. We propose a novel combination of a sparse DL algorithm, Iterative Codebook Feature Learning (ICFL), with a PCA whitening pre-processing step derived from control data. Using this combined approach, we successfully retrieve biologically meaningful concepts, such as cell types and genetic perturbations. Moreover, we demonstrate how our method reveals subtle morphological changes arising from human-interpretable interventions, offering a promising new direction for scientific discovery via mechanistic interpretability in bioimaging.

Paperid:3170 poster

Authors:Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu

Title: LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning

Abstract:
Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we callPrincipal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method:Low-rankInformed SparseFine-Tuning ($\texttt{LIFT}$). $\texttt{LIFT}$ only updates the top 5%Principal Weightsthroughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, $\texttt{LIFT}$ also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at:https://github.com/zihanghliu/LIFT.

Paperid:3171 poster

Authors:Roie Reshef, Kfir Yehuda Levy

Title: Privacy-Preserving Federated Convex Optimization: Balancing Partial-Participation and Efficiency via Noise Cancellation

Abstract:
This paper addresses the challenge of achieving Differential Privacy (DP) in Federated Learning (FL) under the partial-participation setting, where each machine participates in only some of training rounds. While earlier work achieved optimal performance and efficiency in full-participation scenarios, these methods could not extend effectively to cases with partial-participation. Our approach addresses this gap by introducing a novel noise-cancellation mechanism that ensures privacy without compromising convergence rates or computational efficiency. We analyze our method within the Stochastic Convex Optimization (SCO) framework and demonstrate that it achieves optimal performance for both homogeneous and heterogeneous data distributions. This work broadens the applicability of DP in FL, providing a practical and efficient solution for privacy-preserving learning in distributed systems with partial participation.

Paperid:3172 poster

Authors:Parikshit Pareek, Abhijith Jayakumar, Kaarthik Sundar, Sidhant Misra, Deepjyoti Deka

Title: Optimization Proxies using Limited Labeled Data and Training Time -- A Semi-Supervised Bayesian Neural Network Approach

Abstract:
Constrained optimization problems arise in various engineering systems such as inventory management and power grids. Standard deep neural network (DNN) based machine learning proxies are ineffective in practical settings where labeled data is scarce and training times are limited. We propose a semi-supervised Bayesian Neural Networks (BNNs) based optimization proxy for this complex regime, wherein training commences in a sandwiched fashion, alternating between a supervised learning step for minimizing cost, and an unsupervised learning step for enforcing constraint feasibility. We show that the proposed semi-supervised BNN outperforms DNN architectures on important non-convex constrained optimization problems from energy network operations, achieving up to a tenfold reduction in expected maximum equality gap and halving the inequality gaps. Further, the BNN's ability to provide posterior samples is leveraged to construct practically meaningful probabilistic confidence bounds on performance using a limited validation data, unlike prior methods.

Paperid:3173 poster

Authors:Qixin Zhang, Wei Huang, Can Jin, Puning Zhao, Yao Shu, Li Shen, Dacheng Tao

Title: Multinoulli Extension: A Lossless Yet Effective Probabilistic Framework for Subset Selection over Partition Constraints

Abstract:
Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titledMultinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. The core of ourMultinoulli-SCGalgorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the considered partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposedMEis its intrinsic capacity to provide a lossless rounding scheme for any set function. Finally, we validate the practical efficacy of our proposed algorithms by applying them to video summarization, bayesian A-optimal design and coverage maximization.

Paperid:3174 poster

Authors:Lianbo Ma, Jianlun Ma, Yuee Zhou, Guoyang Xie, Qiang He, Zhichao Lu

Title: Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning

Abstract:
Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization strategies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization strategies on small datasets and then generalizes them to large-scale datasets. This approach simplifies the process, eliminating the need for large-scale quantization fine-tuning and only necessitating model weight adjustment. Our method is characterized by three key techniques: sharpness-aware minimization for enhanced quantized model generalization, implicit gradient direction alignment to handle gradient conflicts among different optimization objectives, and an adaptive perturbation radius to accelerate optimization. It offers advantages such as no intricate computation of feature maps and high search efficiency. Both theoretical analysis and experimental results validate our approach. Using the CIFAR10 dataset (just 0.5% the size of ImageNet training data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a significantly lower computational cost, while improving efficiency by up to 150% over the baselines.

Paperid:3175 poster

Authors:Marta Gentiloni Silveri, Antonio Ocello

Title: Beyond Log-Concavity and Score Regularity: Improved Convergence Bounds for Score-Based Generative Models in W2-distance

Abstract:
Score-based Generative Models (SGMs) aim to sample from a target distribution by learning score functions using samples perturbed by Gaussian noise. Existing convergence bounds for SGMs in theW2-distance rely on stringent assumptions about the data distribution. In this work, we present a novel framework for analyzingW2-convergence in SGMs, significantly relaxing traditional assumptions such as log-concavity and score regularity. Leveraging the regularization properties of the Ornstein-Uhlenbeck (OU) process, we show that weak log-concavity of the data distribution evolves into log-concavity over time. This transition is rigorously quantified through a PDE-based analysis of the Hamilton-Jacobi-Bellman equation governing the log-density of the forward process. Moreover, we establish that the drift of the time-reversed OU process alternates between contractive and non-contractive regimes, reflecting the dynamics of concavity. Our approach circumvents the need for stringent regularity conditions on the score function and its estimators, relying instead on milder, more practical assumptions. We demonstrate the wide applicability of this framework through explicit computations on Gaussian mixture models, illustrating its versatility and potential for broader classes of data distributions.

Paperid:3176 poster

Authors:Ziming Hong, Runnan Chen, Zengmao Wang, Bo Han, Bo Du, Tongliang Liu

Title: When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need

Abstract:
Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator's attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers.

Paperid:3177 poster

Authors:Muhammed Ustaomeroglu, Guannan Qu

Title: A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization

Abstract:
Self-attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens ofinteracting entities, ranging from agents in multi-agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self-attention canefficientlyrepresent, learn, and generalize functions capturing pairwise interactions, including out-of-distribution scenarios. Our analysis reveals that self-attention acts as amutual interaction learnerunder minimal assumptions on the diversity of interaction patterns observed during training, thereby encompassing a wide variety of real-world domains. In addition, we validate our theoretical insights through experiments demonstrating that self-attention learns interaction functions and generalizes across both population distributions and out-of-distribution scenarios. Building on our theories, we introduceHyperFeatureAttention, a novel neural network module designed to learn couplings of different feature-level interactions between entities. Furthermore, we proposeHyperAttention, a new module that extends beyond pairwise interactions to capture multi-entity dependencies, such as three-way, four-way, or general $n$-way interactions.

Paperid:3178 poster

Authors:Anders Aamand, Justin Y. Chen, Mina Dalirrooyfard, Slobodan Mitrović, Yuriy Nevmyvaka, Sandeep Silwal, Yinzhan Xu

Title: Breaking then1.5Additive Error Barrier for Private and Efficient Graph Sparsification via Private Expander Decomposition

Abstract:
We study differentially private algorithms for graph cut sparsification, a fundamental problem in algorithms, privacy, and machine learning. While significant progress has been made, the best-known private and efficient cut sparsifiers on $n$-node graphs approximate each cut within $\widetilde{O}(n^{1.5})$ additive error and $1+\gamma$ multiplicative error for any $\gamma > 0$ [Gupta, Roth, Ullman TCC'12]. In contrast, \emph{inefficient} algorithms, i.e., those requiring exponential time, can achieve an $\widetilde{O}(n)$ additive error and $1+\gamma$ multiplicative error [Eliáš, Kapralov, Kulkarni, Lee SODA'20]. In this work, we break the $n^{1.5}$ additive error barrier for private and efficient cut sparsification. We present an $(\varepsilon,\delta)$-DP polynomial time algorithm that, given a non-negative weighted graph, outputs a private synthetic graph approximating all cuts with multiplicative error $1+\gamma$ and additive error $n^{1.25 + o(1)}$ (ignoring dependencies on $\varepsilon, \delta, \gamma$). At the heart of our approach lies a private algorithm for expander decomposition, a popular and powerful technique in (non-private) graph algorithms.

Paperid:3179 poster

Authors:Ze'ev Zukerman, Bassel Hamoud, Kfir Yehuda Levy

Title: Beyond Communication Overhead: A Multilevel Monte Carlo Approach for Mitigating Compression Bias in Distributed Learning

Abstract:
Distributed learning methods have gained substantial momentum in recent years, with communication overhead often emerging as a critical bottleneck. Gradient compression techniques alleviate communication costs but involve an inherent trade-off between the empirical efficiency of biased compressors and the theoretical guarantees of unbiased compressors. In this work, we introduce a novel Multilevel Monte Carlo (MLMC) compression scheme that leverages biased compressors to construct statistically unbiased estimates. This approach effectively bridges the gap between biased and unbiased methods, combining the strengths of both. To showcase the versatility of our method, we apply it to popular compressors, like Top-$k$ and bit-wise compressors, resulting in enhanced variants. Furthermore, we derive an adaptive version of our approach to further improve its performance. We validate our method empirically on distributed deep learning tasks.

Paperid:3180 poster

Authors:Runyu Lu, Yuanheng Zhu, Dongbin Zhao

Title: Constrained Exploitability Descent: An Offline Reinforcement Learning Method for Finding Mixed-Strategy Nash Equilibrium

Abstract:
This paper proposes Constrained Exploitability Descent (CED), a model-free offline reinforcement learning (RL) algorithm for solving adversarial Markov games (MGs). CED combines the game-theoretical approach of Exploitability Descent (ED) with policy constraint methods from offline RL. While policy constraints can perturb the optimal pure-strategy solutions in single-agent scenarios, we find the side effect less detrimental in adversarial games, where the optimal policy can be a mixed-strategy Nash equilibrium. We theoretically prove that, under the uniform coverage assumption on the dataset, CED converges to a stationary point in deterministic two-player zero-sum Markov games. We further prove that the min-player policy at the stationary point follows the property of mixed-strategy Nash equilibrium in MGs. Compared to the model-based ED method that optimizes the max-player policy, our CED method no longer relies on a generalized gradient. Experiments in matrix games, a tree-form game, and an infinite-horizon soccer game verify that CED can find an equilibrium policy for the min-player as long as the offline dataset guarantees uniform coverage. Besides, CED achieves a significantly lower NashConv compared to an existing pessimism-based method and can gradually improve the behavior policy even under non-uniform data coverages. When combined with neural networks, CED also outperforms behavior cloning and offline self-play in a large-scale two-team robotic combat game.

Paperid:3181 poster

Authors:Cheng Huang, Pan Mu, Cong Bai, Peter AG Watson

Title: TCP-Diffusion: A Multi-modal Diffusion Model for Global Tropical Cyclone Precipitation Forecasting with Change Awareness

Abstract:
Deep learning methods have made significant progress in regular rainfall forecasting, yet the more hazardous tropical cyclone (TC) rainfall has not received the same attention. While regular rainfall models can offer valuable insights for designing TC rainfall forecasting models, most existing methods suffer from cumulative errors and lack physical consistency. Additionally, these methods overlook the importance of meteorological factors in TC rainfall and their integration with the numerical weather prediction (NWP) model. To address these issues, we propose Tropical Cyclone Precipitation Diffusion (TCP-Diffusion), a multi-modal model for forecasting of TC precipitation given an existing TC in any location globally. It forecasts rainfall around the TC center for the next 12 hours at 3 hourly resolution based on past rainfall observations and multi-modal environmental variables. Adjacent residual prediction (ARP) changes the training target from the absolute rainfall value to the rainfall trend and gives our model the capability of rainfall change awareness, reducing cumulative errors and ensuring physical consistency. Considering the influence of TC-related meteorological factors and the useful information from NWP model forecasts, we propose a multi-model framework with specialized encoders to extract richer information from environmental variables and results provided by NWP models. The results of extensive experiments show that our method outperforms other DL methods and the NWP method from the European Centre for Medium-Range Weather Forecasts (ECMWF).

Paperid:3182 poster

Authors:Tianyu Zhang, Andrew Robert Williams, Phillip Wozny, Kai-Hendrik Cohrs, Koen Ponse, Marco Jiralerspong, Soham Rajesh Phade, Sunil Srinivasa, Lu Li, Yang Zhang, Prateek Gupta, Erman Acar, Irina Rish, Yoshua Bengio, Stephan Zheng

Title: AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N

Abstract:
Global cooperation on climate change mitigation is essential to limit temperature increases while supporting long-term, equitable economic growth and sustainable development. Achieving such cooperation among diverse regions, each with different incentives, in a dynamic environment shaped by complex geopolitical and economic factors, without a central authority, is a profoundly challenging game-theoretic problem. This article introduces RICE-N, a multi-region integrated assessment model that simulates the global climate, economy, and climate negotiations and agreements. RICE-N uses multi-agent reinforcement learning (MARL) to encourage agents to develop strategic behaviors based on the environmental dynamics and the actions of the others. We present two negotiation protocols: (1) Bilateral Negotiation, an exemplary protocol and (2) Basic Club, inspired from Climate Clubs and the carbon border adjustment mechanism (Nordhaus, 2015; Comissions, 2022). We compare their impact against a no-negotiation baseline with various mitigation strategies, showing that both protocols significantly reduce temperature growth at the cost of a minor drop in production while ensuring a more equitable distribution of the emission reduction costs.

Paperid:3183 poster

Authors:Run He, Di Fang, Yicheng Xu, Yawen Cui, Ming Li, Cen Chen, Ziqian Zeng, Huiping Zhuang

Title: Semantic Shift Estimation via Dual-Projection and Classifier Reconstruction for Exemplar-Free Class-Incremental Learning

Abstract:
Exemplar-Free Class-Incremental Learning (EFCIL) aims to sequentially learn from distinct categories without retaining exemplars but easily suffers from catastrophic forgetting of learned knowledge. While existing EFCIL methods leverage knowledge distillation to alleviate forgetting, they still face two critical challenges: semantic shift and decision bias. Specifically, the embeddings of old tasks shift in the embedding space after learning new tasks, and the classifier becomes biased towards new tasks due to training solely with new data, hindering the balance between old and new knowledge. To address these issues, we propose the Dual-Projection Shift Estimation and Classifier Reconstruction (DPCR) approach for EFCIL. DPCR effectively estimates semantic shift through a dual-projection, which combines a learnable transformation with a row-space projection to capture both task-wise and category-wise shifts. Furthermore, to mitigate decision bias, DPCR employs ridge regression to reformulate a classifier reconstruction process. This reconstruction exploits previous in covariance and prototype of each class after calibration with estimated shift, thereby reducing decision bias. Extensive experiments demonstrate that, on various datasets, DPCR effectively balances old and new tasks, outperforming state-of-the-art EFCIL methods. Our codes are available athttps://github.com/RHe502/ICML25-DPCR.

Paperid:3184 poster

Authors:Yuchen Wang, Xuefeng Bai, Xiucheng Li, Weili Guan, Liqiang Nie, Xinyang Chen

Title: Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin

Abstract:
Adapting vision-language models (VLMs) to downstream tasks with pseudolabels has gained increasing attention. A major obstacle is that the pseudolabels generated by VLMs tend to be imbalanced, leading to inferior performance. While existing methods have explored various strategies to address this, the underlying causes of imbalance remain insufficiently investigated. To fill this gap, we delve into imbalanced pseudolabels and identify two primary contributing factors: concept mismatch and concept confusion. To mitigate these two issues, we propose a novel framework incorporating concept alignment and confusion-aware calibrated margin mechanisms. The core of our approach lies in enhancing underperforming classes and promoting balanced predictions across categories, thus mitigating imbalance. Extensive experiments on six benchmark datasets with three learning paradigms demonstrate that the proposed method effectively enhances the accuracy and balance of pseudolabels, achieving a relative improvement of 6.29% over the SoTA method. Our code is avaliable athttps://github.com/Noahwangyuchen/CAP

Paperid:3185 poster

Authors:Xin Yu, Zelin He, Ying Sun, Lingzhou Xue, Runze Li

Title: Understanding the Statistical Accuracy-Communication Trade-off in Personalized Federated Learning with Minimax Guarantees

Abstract:
Personalized federated learning (PFL) offers a flexible framework for aggregating information across distributed clients with heterogeneous data. This work considers a personalized federated learning setting that simultaneously learns global and local models. While purely local training has no communication cost, collaborative learning among the clients can leverage shared knowledge to improve statistical accuracy, presenting an accuracy-communication trade-off in personalized federated learning. However, the theoretical analysis of how personalization quantitatively influences sample and algorithmic efficiency and their inherent trade-off is largely unexplored. This paper makes a contribution towards filling this gap, by providing a quantitative characterization of the personalization degree on the tradeoff. The results further offer theoretical insights for choosing the personalization degree. As a side contribution, we establish the minimax optimality in terms of statistical accuracy for a widely studied PFL formulation. The theoretical result is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting.

Paperid:3186 poster

Authors:Ankit Pratap Singh, Ahmed Ali Abbasi, Namrata Vaswani

Title: Byzantine-Resilient Federated Alternating Gradient Descent and Minimization for Partly-Decoupled Low Rank Matrix Learning

Abstract:
This work has two contributions. First, we introduce a provably secure (Byzantine-resilient) sample- and communication-efficient alternating gradient descent (GD) and minimization based algorithms for solving the federated low rank matrix completion (LRMC) problem. This involves learning a low rank (LR) matrix from a small subset of its entries. Second, we extend our ideas to show how a special case of our algorithms also solves another partly-decoupled vertically federated LR matrix learning problem, that is LR column-wise sensing (LRCS), also referred to as multi-task linear representation learning in the literature. Finally, we also show how our results can be extended for the LR phase retrieval problem. In all problems, we consider column-wise or vertical federation, i.e. each node observes a small subset of entries of a disjoint column sub-matrix of the entire LR matrix. For the LRMC problem, horizontal federation is equivalent since it is symmetric across rows and columns; while for the other two it is not. In all problems, the data at different nodes is heterogeneous (not identically distributed), making it harder to obtain provable guarantees.

Paperid:3187 poster

Authors:Chang Liu, Yixin Wang, Moontae Lee

Title: Counterfactual Voting Adjustment for Quality Assessment and Fairer Voting in Online Platforms with Helpfulness Evaluation

Abstract:
Efficient access to high-quality information is vital for online platforms. To promote more useful information, users not only create new content but also evaluate existing content, often through helpfulness voting. Although aggregated votes help service providers rank their user content, these votes are often biased by disparate accessibility per position and the cascaded influence of prior votes. For a fairer assessment of information quality, we propose the Counterfactual Voting Adjustment (CVA), a causal framework that accounts for the context in which individual votes are cast. Through preliminary and semi-synthetic experiments, we show that CVA effectively models the position and herding biases, accurately recovering the predefined content quality. In a real experiment, we demonstrate that reranking content based on the learned quality by CVA exhibits stronger alignment with both user sentiment and quality evaluation assessed by GPT-4o, outperforming system rankings based on aggregated votes and model-based rerankings without causal inference. Beyond the individual quality inference, our embeddings offer comparative insights into the behavioral dynamics of expert user groups across 120 major StackExchange communities.

Paperid:3188 poster

Authors:Hongbin Pei, Jingxin Hai, Yu Li, Huiqi Deng, Denghao Ma, Jie Ma, Pinghui Wang, Jing Tao, Xiaohong Guan

Title: Non-Stationary Predictions May Be More Informative: Exploring Pseudo-Labels with a Two-Phase Pattern of Training Dynamics

Abstract:
Pseudo-labeling is a widely used strategy in semi-supervised learning. Existing methods typically select predicted labels with high confidence scores and high training stationarity, as pseudo-labels to augment training sets. In contrast, this paper explores the pseudo-labeling potential of predicted labels thatdo notexhibit these characteristics. We discover a new type of predicted labels suitable for pseudo-labeling, termedtwo-phase labels, which exhibit a two-phase pattern during training:they are initially predicted as one category in early training stages and switch to another category in subsequent epochs.Case studies show the two-phase labels are informative for decision boundaries. To effectively identify the two-phase labels, we design a 2-phasicmetric that mathematically characterizes their spatial and temporal patterns. Furthermore, we propose a loss function tailored for two-phase pseudo-labeling learning, allowing models not only to learn correct correlations but also to eliminate false ones. Extensive experiments on eight datasets show thatour proposed 2-phasicmetric acts as a powerful boosterfor existing pseudo-labeling methods by additionally incorporating the two-phase labels, achieving an average classification accuracy gain of 1.73% on image datasets and 1.92% on graph datasets.

Paperid:3189 poster

Authors:Tianyi Bao, Ruizhe Zhong, Xinyu Ye, Yehui Tang, Junchi Yan

Title: QEM-Bench: Benchmarking Learning-based Quantum Error Mitigation and QEMFormer as a Multi-ranged Context Learning Baseline

Abstract:
Quantum Error Mitigation (QEM) has emerged as a pivotal technique for enhancing the reliability of noisy quantum devices in theNoisy Intermediate-Scale Quantum(NISQ) era. Recently, machine learning (ML)-based QEM approaches have demonstrated strong generalization capabilities without sampling overheads compared to conventional methods. However, evaluating these techniques is often hindered by a lack of standardized datasets and inconsistent experimental settings across different studies. In this work, we presentQEM-Bench, a comprehensive benchmark suite oftwenty-twodatasets covering diverse circuit types and noise profiles, which provides a unified platform for comparing and advancing ML-based QEM methods. We further propose a refined ML-based QEM pipelineQEMFormer, which leverages a feature encoder that preserves local, global, and topological information, along with a two-branch model that captures short-range and long-range dependencies within the circuit. Empirical evaluations on QEM-Bench illustrate the superior performance of QEMFormer over existing baselines, underscoring the potential of integrated ML-QEM strategies.

Paperid:3190 poster

Authors:Waïss Azizian, Franck Iutzeler, Jerome Malick, Panayotis Mertikopoulos

Title: The Global Convergence Time of Stochastic Gradient Descent in Non-Convex Landscapes: Sharp Estimates via Large Deviations

Abstract:
In this paper, we examine the time it takes for stochastic gradient descent (SGD) to reach the global minimum of a general, non-convex loss function. We approach this question through the lens of large deviations theory and randomly perturbed dynamical systems, and we provide a tight characterization of the associated hitting times of SGD with matching upper and lower bounds. Our analysis reveals that the global convergence time of SGD is dominated by the most "costly" set of obstacles that the algorithm may need to overcome in order to reach a global minimizer, coupling in this way the geometry of the underlying loss landscape with the statistics of the noise entering the process. Finally, motivated by applications to the training of deep neural networks, we provide a series of refinements and extensions of our analysis to, among others, loss functions with no spurious local minima or ones with bounded depths.

Paperid:3191 poster

Authors:Bingchen Miao, Yang Wu, Minghe Gao, Qifan Yu, Wendong Bu, Wenqiao Zhang, liyunfei, Siliang Tang, Tat-Seng Chua, Juncheng Li

Title: Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark

Abstract:
The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we proposeSimilar, astep-wisemulti-dimensionalgeneralistreward model, which offers fine-grained signals for agent training and can choose better actions for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we trainSimilarwith our crafted Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, namedSRM. This benchmark consists of two components:SRMTrain, which serves as the training set forSimilar, andSRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate thatSimilar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available athttps://github.com/antgroup/Similar.

Paperid:3192 poster

Authors:Sen Peng, Mingyue Wang, Jianfei He, Jijia Yang, Xiaohua Jia

Title: CAT: Contrastive Adversarial Training for Evaluating the Robustness of Protective Perturbations in Latent Diffusion Models

Abstract:
Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners. Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them. In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments. We then propose the Contrastive Adversarial Training (CAT) utilizing lightweight adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization, urging the community to reconsider and improve the robustness of existing protective perturbations. The code is available at \url{https://github.com/senp98/CAT}.

Paperid:3193 poster

Authors:Pengyi Li, Jianye HAO, Hongyao Tang, Yifu Yuan, Jinbin Qiao, Zibin Dong, YAN ZHENG

Title: R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

Abstract:
Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency. To address the challenges, we propose an efficient automated reward design framework, called R*, which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R* maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation. To design more efficient reward parameters, R* first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels. These labeled segments are then used to refine the reward function parameters through preference learning. Experiments on diverse robotic control tasks demonstrate that R* outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.

Paperid:3194 poster

Authors:Han Zhou, Jordy Van Landeghem, Teodora Popordanoska, Matthew B. Blaschko

Title: A Novel Characterization of the Population Area Under the Risk Coverage Curve (AURC) and Rates of Finite Sample Estimators

Abstract:
The selective classifier (SC) has been proposed for rank based uncertainty thresholding, which could have applications in safety critical areas such as medical diagnostics, autonomous driving, and the justice system. The Area Under the Risk-Coverage Curve (AURC) has emerged as the foremost evaluation metric for assessing the performance of SC systems. In this work, we present a formal statistical formulation of population AURC, presenting an equivalent expression that can be interpreted as a reweighted risk function. Through Monte Carlo methods, we derive empirical AURC plug-in estimators for finite sample scenarios. The weight estimators associated with these plug-in estimators are shown to be consistent, with low bias and tightly bounded mean squared error (MSE). The plug-in estimators are proven to converge at a rate ofO(ln�?n)/n)demonstrating statistical consistency. We empirically validate the effectiveness of our estimators through experiments across multiple datasets, model architectures, and confidence score functions (CSFs), demonstrating consistency and effectiveness in fine-tuning AURC performance.

Paperid:3195 poster

Authors:Qinglong Liu, Cong Xu, Wenhao Jiang, Kaixuan Wang, Lin Ma, Haifeng Li

Title: TimeStacker: A Novel Framework with Multilevel Observation for Capturing Nonstationary Patterns in Time Series Forecasting

Abstract:
Real-world time series inherently exhibit significant non-stationarity, posing substantial challenges for forecasting. To address this issue, this paper proposes a novel prediction framework, TimeStacker, designed to overcome the limitations of existing models in capturing the characteristics of non-stationary signals. By employing a unique stacking mechanism, TimeStacker effectively captures global signal features while thoroughly exploring local details. Furthermore, the framework integrates a frequency-based self-attention module, significantly enhancing its feature modeling capabilities. Experimental results demonstrate that TimeStacker achieves outstanding performance across multiple real-world datasets, including those from the energy, finance, and weather domains. It not only delivers superior predictive accuracy but also exhibits remarkable advantages with fewer parameters and higher computational efficiency.

Paperid:3196 poster

Authors:Hao Li, Yu-Hao Huang, Chang Xu, Viktor Schlegel, Renhe Jiang, Riza Batista-Navarro, Goran Nenadic, Jiang Bian

Title: BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling

Abstract:
Time-series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis. While existing methods have shown promise in unconditional single-domain TSG, real-world applications demand for cross-domain approaches capable of controlled generation tailored to domain-specific constraints and instance-level requirements. In this paper, we argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG. We introduce ``Text-Controlled TSG'', a task focused on generating realistic time series by incorporating textual descriptions. To address data scarcity in this setting, we propose a novel LLM-based Multi-Agent framework that synthesizes diverse, realistic text-to-TS datasets. Furthermore, we introduce Bridge, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance. This approach achieves state-of-the-art generation fidelity on 11 of 12 datasets, and improves controllability by up to 12% on MSE and 6% MAE compared to no text input generation, highlighting its potential for generating tailored time-series data.

Paperid:3197 poster

Authors:Jiujiang Guo, Mankun Zhao, Wenbin Zhang, Tianyi Xu, Linying Xu, Yu Jian, Yu Mei, Yu Ruiguo

Title: TeDS: Joint Learning of Diachronic and Synchronic Perspectives in Quaternion Space for Temporal Knowledge Graph Completion

Abstract:
Existing research on temporal knowledge graph completion treats temporal information as supplementary, without simulating various features of facts from a temporal perspective. This work summarizes features of temporalized facts from both diachronic and synchronic perspectives: (1) Diachronicity. Facts often exhibit varying characteristics and trends across different temporal domains; (2) Synchronicity. In specific temporal contexts, various relations between entities influence each other, generating latent semantics. To track above issues, we design a quaternion-based model, TeDS, which divides timestamps into diachronic and synchronic timestamps to support dual temporal perception: (a) Two composite quaternions fusing time and relation information are generated by reorganizing synchronic timestamp and relation quaternions, and Hamilton operator achieves their interaction. (b) Each time point is sequentially mapped to an angle and converted to scalar component of a quaternion using trigonometric functions to build diachronic timestamps. We then rotate relation by using Hamilton operator between it and diachronic timestamp. In this way, TeDS achieves deep integration of relations and time while accommodating different perspectives. Empirically, TeDS significantly outperforms SOTA models on six benchmarks.

Paperid:3198 poster

Authors:Zachary C. Brown, David Carlson

Title: Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series

Abstract:
The field of hypothesis generation promises to reduce costs in neuroscience by narrowing the range of interventional studies needed to study various phenomena. Existing machine learning methods can generate scientific hypotheses from complex datasets, but many approaches assume causal relationships are static over time, limiting their applicability to systems with dynamic, state-dependent behavior, such as the brain. While some techniques attempt dynamic causal discovery through factor models, they often restrict relationships to linear patterns or impose other simplifying assumptions. We propose a novel method that models dynamic graphs as a conditionally weighted superposition of static graphs, where each static graph can capture nonlinear relationships. This approach enables the detection of complex, time-varying interactions between variables beyond linear limitations. Our method improves f1-scores of predicted dynamic causal patterns by roughly 22-28% on average over baselines in some of our experiments, with some improvements reaching well over 60%. A case study on real brain data demonstrates our method's ability to uncover relationships linked to specific behavioral states, offering valuable insights into neural dynamics.

Paperid:3199 poster

Authors:Pouria Fatemi, Ehsan Sharifian, Mohammad Hossein Yassaee

Title: A New Approach to Backtracking Counterfactual Explanations: A Unified Causal Framework for Efficient Model Interpretability

Abstract:
Counterfactual explanations enhance interpretability by identifying alternative inputs that produce different outputs, offering localized insights into model decisions. However, traditional methods often neglect causal relationships, leading to unrealistic examples. While newer approaches integrate causality, they are computationally expensive. To address these challenges, we propose an efficient method called BRACE based on backtracking counterfactuals that incorporates causal reasoning to generate actionable explanations. We first examine the limitations of existing methods and then introduce our novel approach and its features. We also explore the relationship between our method and previous techniques, demonstrating that it generalizes them in specific scenarios. Finally, experiments show that our method provides deeper insights into model outputs.

Paperid:3200 poster

Authors:Xing Li, Zeyu XING, Yiming Li, Linping Qu, Hui-Ling Zhen, Yiwu Yao, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan

Title: KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Abstract:
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available athttps://github.com/cmd2001/KVTuner.

Paperid:3201 poster

Authors:Batuhan K. Karaman, ishmam zabir, Alon Benhaim, Vishrav Chaudhary, Mert R. Sabuncu, Xia Song

Title: POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Abstract:
Achieving both high safety and high usefulness simultaneously in large language models has become a critical challenge in recent years. Models often exhibit unsafe behavior or adopt an overly cautious approach leading to frequent overrefusal of benign prompts, which reduces their usefulness. A major factor underlying these behaviors is how the models are finetuned and aligned, particularly the nature and extent of the data used. In this work, we examine how overgenerating finetuning data with advanced teacher models (e.g., GPT-4o)—covering both general-purpose and toxic prompts—affects safety and usefulness in instruction-following language models. Additionally, we present POROver, an alignment strategy designed for models that are highly safe but prone to overrefusal. POROver employs preference optimization algorithms and leverages completions from an advanced teacher model to reduce overrefusals while maintaining safety. Our results show that overgenerating completions for general-purpose prompts significantly boosts safety with only a minimal impact on usefulness. Specifically, the F1 score calculated between safety and usefulness increases from 74.4% to 91.8% because of a substantial rise in safety. Moreover, overgeneration for toxic prompts raises usefulness from 11.1% to 57.6% while preserving safety. Finally, applying POROVer increases usefulness further—from 57.6% to 82.1%—while keeping safety at comparable levels.

Paperid:3202 poster

Authors:Mehrdad Moghimi, Hyejin Ku

Title: Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

Abstract:
In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.

Paperid:3203 poster

Authors:Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci

Title: Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models

Abstract:
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we proposePivotingFactorization (PIFA), a novellosslessmeta low-rank representation that unsupervisedly learns acompactform of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving24.2%additional memory savings and24.6%faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free reconstruction method thatminimizes error accumulation (M).MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods, and achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility. Our code is available athttps://github.com/biomedical-cybernetics/pivoting-factorization.

Paperid:3204 poster

Authors:Yujin Kim, Hyunsoo Kim, Hyunwoo J. Kim, Suhyun Kim

Title: When Model Knowledge meets Diffusion Model: Diffusion-assisted Data-free Image Synthesis with Alignment of Domain and Class

Abstract:
Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS methods produce samples that deviate from the training data distribution due to the lack of prior knowledge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and ImageNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.

Paperid:3205 poster

Authors:Anna Soligo, Pietro Ferraro, David Boyle

Title: Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning

Abstract:
Interpretability is crucial for ensuring RL systems align with human values. However, it remains challenging to achieve in complex decision making domains. Existing methods frequently attempt interpretability at the level of fundamental model units, such as neurons or decision nodes: an approach which scales poorly to large models. Here, we instead propose an approach to interpretability at the level of functional modularity. We show how encouraging sparsity and locality in network weights leads to the emergence of functional modules in RL policy networks. To detect these modules, we develop an extended Louvain algorithm which uses a novel `correlation alignment' metric to overcome the limitations of standard network analysis techniques when applied to neural network architectures. Applying these methods to 2D and 3D MiniGrid environments reveals the consistent emergence of distinct navigational modules for different axes, and we further demonstrate how these functions can be validated through direct interventions on network weights prior to inference.

Paperid:3206 poster

Authors:Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou

Title: An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

Abstract:
The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.

Paperid:3207 poster

Authors:Jianliang He, Xintian Pan, Siyu Chen, Zhuoran Yang

Title: In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Abstract:
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query weights, and a last-entry-only and zero-sum pattern in the output-value weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor --- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor.We also extend our study to scenarios with anisotropic covariates and multi-task linear regression. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Paperid:3208 poster

Authors:Zeyu Yang, Wesley Armour

Title: The Hidden Joules: Evaluating the Energy Consumption of Vision Backbones for Progress Towards More Efficient Model Inference

Abstract:
Deep learning has achieved significant success but poses increasing concerns about energy consumption and sustainability. Despite these concerns, there is a lack of understanding of their energy efficiency during inference. In this study, we conduct a comprehensive analysis of the inference energy consumption of 1,200 ImageNet classification models—the largest evaluation of its kind to date. Our findings reveal a steep decline in accuracy gains relative to the increase in energy usage, highlighting sustainability concerns in the pursuit of marginal improvements. We identify key factors contributing to energy consumption and demonstrate methods to improve energy efficiency. To promote more sustainable AI practices, we introduce an energy efficiency scoring system and develop an interactive web application that allows users to compare models based on accuracy and energy consumption. By providing extensive empirical data and practical tools, we aim to facilitate informed decision-making and encourage collaborative efforts in the development of energy-efficient AI technologies.

Paperid:3209 poster

Authors:Shimin Zhang, Ziyuan Ye, Yinsong Yan, Zeyang Song, Yujie Wu, Jibin Wu

Title: KoopSTD: Reliable Similarity Analysis between Dynamical Systems via Approximating Koopman Spectrum with Timescale Decoupling

Abstract:
Determining the similarity between dynamical systems remains a long-standing challenge in both machine learning and neuroscience. Recent works based on Koopman operator theory have proven effective in analyzing dynamical similarity by examining discrepancies in the Koopman spectrum. Nevertheless, existing similarity metrics can be severely constrained when systems exhibit complex nonlinear behaviors across multiple temporal scales. In this work, we proposeKoopSTD, a dynamical similarity measurement framework that precisely characterizes the underlying dynamics by approximating the Koopman spectrum with explicit timescale decoupling and spectral residual control. We show that KoopSTD maintains invariance under several common representation-space transformations, which ensures robust measurements across different coordinate systems. Our extensive experiments on physical and neural systems validate the effectiveness, scalability, and robustness of KoopSTD compared to existing similarity metrics. We also apply KoopSTD to explore two open-ended research questions in neuroscience and large language models, highlighting its potential to facilitate future scientific and engineering discoveries. Code is available atlink.

Paperid:3210 poster

Authors:Hongyi Zhou, Josiah P. Hanna, Jin Zhu, Ying Yang, Chengchun Shi

Title: Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation

Abstract:
This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question ofwhythe use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or non-parametrically.

Paperid:3211 poster

Authors:Simone Bombari, Marco Mondelli

Title: Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization

Abstract:
Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data, with negative implications on robustness, bias and fairness. In this work, we provide a statistical characterization of this phenomenon for high-dimensional regression, when the data contains a predictivecorefeaturexand aspuriousfeaturey. Specifically, we quantify the amount of spurious correlationsClearned via linear regression, in terms of the data covariance and the strengthλof the ridge regularization. As a consequence, we first capture the simplicity ofythrough the spectrum of its covariance, and its correlation withxthrough the Schur complement of the full data covariance. Next, we prove a trade-off betweenCand the in-distribution test lossL, by showing that the value ofλthat minimizesLlies in an interval whereCis increasing. Finally, we investigate the effects of over-parameterization via the random features model, by showing its equivalence to regularized linear regression. Our theoretical results are supported by numerical experiments on Gaussian, Color-MNIST, and CIFAR-10 datasets.

Paperid:3212 poster

Authors:Kosuke Nakanishi, Akihiro Kubo, Yuji Yasui, Shin Ishii

Title: Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation

Abstract:
Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods. In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary. The implementation is available athttps://github.com/nakanakakosuke/VALT_SAC.

Paperid:3213 poster

Authors:Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon

Title: Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Abstract:
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that the proposed evaluation method aligns better with human judgments of factuality than existing metrics. Moreover, we show that current approaches for enhancing MLLM factuality often fail in hyper-detailed image captioning tasks. In contrast, our approach significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.

Paperid:3214 poster

Authors:Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Kening Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, Xuming Hu

Title: Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models

Abstract:
Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are prone to hallucinations, i.e., the generated content that is nonsensical or unfaithful to input sources. Unlike in LLMs, hallucinations in MLLMs often stem from the sensitivity of text decoder to visual tokens, leading to a phenomenon akin to "amnesia" about visual information. To address this issue, we propose MemVR, a novel decoding paradigm inspired by common cognition: when the memory of an image seen the moment before is forgotten, people will look at it again for factual answers. Following this principle, we treat visual tokens as supplementary evidence, re-injecting them into the MLLM through Feed Forward Network (FFN) as “key-value memory�?at the middle trigger layer. This look-twice mechanism occurs when the model exhibits high uncertainty during inference, effectively enhancing factual alignment. Comprehensive experimental evaluations demonstrate that MemVR significantly mitigates hallucination across various MLLMs and excels in general benchmarks without incurring additional time overhead.

Paperid:3215 poster

Authors:Yang Zheng, Wen Li, Zhaoqiang Liu

Title: Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models

Abstract:
Inverse problems (IPs) involve reconstructing signals from noisy observations. Recently, diffusion models (DMs) have emerged as a powerful framework for solving IPs, achieving remarkable reconstruction performance. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approaches under appropriate conditions and validate their superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.

Paperid:3216 poster

Authors:Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu

Title: MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention

Abstract:
The integration of long-context capabilities with visual understanding unlocks unprecedented potential for Vision Language Models (VLMs). However, the quadratic attention complexity during the pre-filling phase remains a significant obstacle to real-world deployment. To overcome this limitation, we introduce MMInference (Multimodality Million tokens Inference), a dynamic sparse attention method that accelerates the prefilling stage for long-context multi-modal inputs. First, our analysis reveals that the temporal and spatial locality of video input leads to a unique sparse pattern, the Grid pattern. Simultaneously, VLMs exhibit markedly different sparse distributions across different modalities. We introduce a permutation-based method to leverage the unique Grid pattern and handle modality boundary issues. By offline search the optimal sparse patterns for each head, MMInference constructs the sparse distribution dynamically based on the input. We also provide optimized GPU kernels for efficient sparse computations. Notably, MMInference integrates seamlessly into existing VLM pipelines without any model modifications or fine-tuning. Experiments on multi-modal benchmarks-including Video QA, Captioning, VisionNIAH, and Mixed-Modality NIAH-with state-of-the-art long-context VLMs (LongVila, LlavaVideo, VideoChat-Flash, Qwen2.5-VL) show that MMInference accelerates the pre-filling stage by up to 8.3x at 1M tokens while maintaining accuracy. Our code is available athttps://ama.ms/MMInference.

Paperid:3217 poster

Authors:Yun Qu, Cheems Wang, Yixiu Mao, Yiqin Lv, Xiangyang Ji

Title: Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments

Abstract:
Task robust adaptation is a long-standing pursuit in sequential decision-making. Some risk-averse strategies, e.g., the conditional value-at-risk principle, are incorporated in domain randomization or meta reinforcement learning to prioritize difficult tasks in optimization, which demand costly intensive evaluations. The efficiency issue prompts the development of robust active task sampling to train adaptive policies, where risk-predictive models can surrogate policy evaluation. This work characterizes robust active task sampling as a secret Markov decision process, posits theoretical and practical insights, and constitutes robustness concepts in risk-averse scenarios. Importantly, we propose an easy-to-implement method, referred to as Posterior and Diversity Synergized Task Sampling (PDTS), to accommodate fast and robust sequential decision-making. Extensive experiments show that PDTS unlocks the potential of robust active task sampling, significantly improves the zero-shot and few-shot adaptation robustness in challenging tasks, and even accelerates the learning process under certain scenarios.

Paperid:3218 poster

Authors:Juwei Yue, Haikuo Li, Jiawei Sheng, Xiaodong Li, Taoyu Su, Tingwen Liu, Li Guo

Title: Hyperbolic-PDE GNN: Spectral Graph Neural Networks in the Perspective of A System of Hyperbolic Partial Differential Equations

Abstract:
Graph neural networks (GNNs) leverage message passing mechanisms to learn the topological features of graph data. Traditional GNNs learns node features in a spatial domain unrelated to the topology, which can hardly ensure topological features. In this paper, we formulates message passing as a system of hyperbolic partial differential equations (hyperbolic PDEs), constituting a dynamical system that explicitly maps node representations into a particular solution space. This solution space is spanned by a set of eigenvectors describing the topological structure of graphs. Within this system, for any moment in time, a node features can be decomposed into a superposition of the basis of eigenvectors. This not only enhances the interpretability of message passing but also enables the explicit extraction of fundamental characteristics about the topological structure. Furthermore, by solving this system of hyperbolic partial differential equations, we establish a connection with spectral graph neural networks (spectral GNNs), serving as a message passing enhancement paradigm for spectral GNNs.We further introduce polynomials to approximate arbitrary filter functions. Extensive experiments demonstrate that the paradigm of hyperbolic PDEs not only exhibits strong flexibility but also significantly enhances the performance of various spectral GNNs across diverse graph tasks.

Paperid:3219 poster

Authors:Junwei Su, Chuan Wu

Title: A Non-Asymptotic Convergent Analysis for Scored-Based Graph Generative Model via a System of Stochastic Differential Equations

Abstract:
This paper investigates the convergence behavior of score-based graph generative models (SGGMs). Unlike common score-based generative models (SGMs) that are governed by a single stochastic differential equation (SDE), SGGMs utilize a system of dependent SDEs, where the graph structure and node features are modeled separately, while accounting for their inherent dependencies. This distinction makes existing convergence analyses from SGMs inapplicable for SGGMs. In this work, we present the first convergence analysis for SGGMs, focusing on the convergence bound (the risk of generative error) across three key graph generation paradigms: (1) feature generation with a fixed graph structure, (2) graph structure generation with fixed node features, and (3) joint generation of both graph structure and node features. Our analysis reveals several unique factors specific to SGGMs (e.g., the topological properties of the graph structure) which significantly affect the convergence bound. Additionally, we offer theoretical insights into the selection of hyperparameters (e.g., sampling steps and diffusion length) and advocate for techniques like normalization to improve convergence. To validate our theoretical findings, we conduct a controlled empirical study using a synthetic graph model. The results in this paper contribute to a deeper theoretical understanding of SGGMs and offer practical guidance for designing more efficient and effective SGGMs.

Paperid:3220 poster

Authors:Qi Yang, Le Yang, Geert Van der Auwera, Zhu Li

Title: HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder

Abstract:
Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long encoding and decoding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dual-channel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to carry out further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently faster encoding and decoding speed. The code is publicly available athttps://github.com/Qi-Yangsjtu/HybridGS.

Paperid:3221 poster

Authors:Taehyun Cho, Seungyub Han, Seokhun Ju, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee

Title: Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

Abstract:
Distributional reinforcement learning improves performance by capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In addition, the intractable element of the infinite dimensionality of distributions has been overlooked. In this paper, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness which is essential for exactly learnable and provably efficient distributional updates in an online manner. Among all types of statistical functionals for representing infinite-dimensional return distributions, our theoretical results demonstrate that only moment functionals can exactly capture the statistical information. Secondly, we propose a provably efficient algorithm, SF-LSVI, that achieves a tight regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.

Paperid:3222 poster

Authors:Xiaoyu Yang, Lijian Xu, Hongsheng Li, Shaoting Zhang

Title: One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation

Abstract:
This paper proposes a scalable and straightforward pre-training paradigm for efficient visual conceptual representation called occluded image contrastive learning (OCL). Our OCL approach is simple: we randomly mask patches to generate different views within an image and contrast them among a mini-batch of images. The core idea behind OCL consists of two designs. First, masked tokens have the potential to significantly diminish the conceptual redundancy inherent in images, and create distinct views with substantial fine-grained differences on the semantic concept level instead of the instance level. Second, contrastive learning is adept at extracting high-level semantic conceptual features during the pre-training, circumventing the high-frequency interference and additional costs associated with image reconstruction. Importantly, OCL learns highly semantic conceptual representations efficiently without relying on hand-crafted data augmentations or additional auxiliary modules. Empirically, OCL demonstrates high scalability with Vision Transformers, as the ViT-L/16 can complete pre-training in 133 hours using only 4 A100 GPUs, achieving 85.8% accuracy in downstream fine-tuning tasks. Code is available athttps://github.com/XiaoyuYoung/OCL.

Paperid:3223 poster

Authors:Alex Clinton, Yiding Chen, Jerry Zhu, Kirthevasan Kandasamy

Title: Collaborative Mean Estimation Among Heterogeneous Strategic Agents: Individual Rationality, Fairness, and Truthful Contribution

Abstract:
We study a collaborative learning problem wheremagents aim to estimate a vectorμ=(μ1,�?μd)∈Rdby sampling from associated univariate normal distributions(N(μk,σ2))k∈[d]. Agentiincurs a costci,kto sample fromN(μk,σ2). Instead of working independently, agents can exchange data, collecting cheaper samples and sharing them in return for costly data, thereby reducing both costs and estimation error. We design a mechanism to facilitate such collaboration, while addressing two key challenges: ensuringindividually rational (IR) and fair outcomesso all agents benefit, andpreventing strategic behavior(e.g. non-collection, data fabrication) to avoid socially undesirable outcomes. We design a mechanism and an associated Nash equilibrium (NE) which minimizes the social penalty-sum of agents' estimation errors and collection costs-while being IR for all agents. We achieve aO(m)-approximation to the minimum social penalty in the worst case and anO(1)-approximation under favorable conditions. Additionally, we establish three hardness results: no nontrivial mechanism guarantees(i)a dominant strategy equilibrium where agents report truthfully,(ii)is IR for every strategy profile of other agents,(iii)or avoids a worst-caseΩ(m)price of stability in any NE. Finally, by integrating concepts from axiomatic bargaining, we demonstrate that our mechanism supports fairer outcomes than one which minimizes social penalty.

Paperid:3224 poster

Authors:Xudong Gong, Sen Yang, Feng Dawei, Kele Xu, Bo Ding, Huaimin Wang, Yong Dou

Title: Improving the Continuity of Goal-Achievement Ability via Policy Self-Regularization for Goal-Conditioned Reinforcement Learning

Abstract:
This paper addresses the challenge of discontinuity in goal-achievement capabilities observed in Goal-conditioned Reinforcement Learning (GCRL) algorithms. Through a theoretical analysis, we identify that the reuse of successful trajectories or policies during training can aid in achieving adjacent goals of achievable goals. However, the policy discrepancy between achievable and adjacent goals must be carefully managed to avoid both overly trivial and excessively large differences, which can respectively hinder policy performance. To tackle this issue, we propose a margin-based policy self-regularization approach that optimizes the policy discrepancies between adjacent desired goals to a minimal acceptable threshold. This method can be integrated into popular GCRL algorithms, such as GC-SAC, HER, and GC-PPO. Systematic evaluations across two robotic arm control tasks and a complex fixed-wing aircraft control task demonstrate that our approach significantly improves the continuity of goal-achievement abilities of GCRL algorithms, thereby enhancing their overall performance.

Paperid:3225 poster

Authors:Yifan Sun, Han Wang, Dongbai Li, Gang Wang, Huan Zhang

Title: The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

Abstract:
Benchmark Data Contamination (BDC)—the inclusion of benchmark testing samples in the training set—has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics—fidelityandcontamination resistance—to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizingquestion-levelevaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy effectively balances fidelity and contamination resistance. No semantic-preserving strategy yields a significant improvement in resistance over the vanilla case (i.e., no benchmark update) acrossallbenchmarks, while semantic-altering strategies sacrifice fidelity for resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available athttps://github.com/ASTRAL-Group/BDC_mitigation_assessment.

Paperid:3226 poster

Authors:Ekaterina Borodich, Alexander Gasnikov, Dmitry Kovalev

Title: On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms

Abstract:
We revisit the smooth convex-concave bilinearly-coupled saddle-point problem of the formminxmaxyf(x)+⟨y,Bx⟩−g(y). In the highly specific case where functionf(x)is strongly convex and functiong(y)is affine, or both functions are affine, there exist lower bounds on the number of gradient evaluations and matrix-vector multiplications required to solve the problem, as well as matching optimal algorithms. A notable aspect of these algorithms is that they are able to attain linear convergence, i.e., the number of iterations required to solve the problem is proportional tolog�?1/ϵ). However, the class of bilinearly-coupled saddle-point problems for which linear convergence is possible is much wider and can involve general smooth non-strongly convex functionsf(x)andg(y). Therefore,we develop the first lower complexity bounds and matching optimal linearly converging algorithms for this problem class. Our lower complexity bounds are much more general, but they cover and unify the existing results in the literature. On the other hand, our algorithm implements the separation of complexities, which, for the first time, enables the simultaneous achievement of both optimal gradient evaluation and matrix-vector multiplication complexities, resulting in the best theoretical performance to date.

Paperid:3227 poster

Authors:Yuecheng Li, Lele Fu, Tong Wang, Jian Lou, Bin Chen, Lei Yang, Jian Shen, Zibin Zheng, Chuan Chen

Title: Clients Collaborate: Flexible Differentially Private Federated Learning with Guaranteed Improvement of Utility-Privacy Trade-off

Abstract:
To defend against privacy leakage of user data, differential privacy is widely used in federated learning, but it is not free. The addition of noise randomly disrupts the semantic integrity of the model and this disturbance accumulates with increased communication rounds. In this paper, we introduce a novel federated learning framework with rigorous privacy guarantees, namedFedCEO, designed to strike a trade-off between model utility and user privacy by letting clients "Collaborate withEachOther". Specifically, we perform efficient tensor low-rank proximal optimization on stacked local model parameters at the server, demonstrating its capability toflexiblytruncate high-frequency components in spectral space. This capability implies that our FedCEO can effectively recover the disrupted semantic information by smoothing the global semantic space for different privacy settings and continuous training processes. Moreover, we improve the SOTA utility-privacy trade-off bound by order of $\sqrt{d}$, where $d$ is the input dimension. We illustrate our theoretical results with experiments on representative datasets and observe significant performance improvements and strict privacy guarantees under different privacy settings. Thecodeis available athttps://github.com/6lyc/FedCEO_Collaborate-with-Each-Other.

Paperid:3228 poster

Authors:Ming Li, Yukang Cheng, Lu Bai, Feilong Cao, Ke Lv, Jiye Liang, Pietro Lio

Title: EduLLM: Leveraging Large Language Models and Framelet-Based Signed Hypergraph Neural Networks for Student Performance Prediction

Abstract:
The growing demand for personalized learning underscores the importance of accurately predicting students' future performance to support tailored education and optimize instructional strategies. Traditional approaches predominantly focus on temporal modeling using historical response records and learning trajectories. While effective, these methods often fall short in capturing the intricate interactions between students and learning content, as well as the subtle semantics of these interactions. To address these gaps, we present EduLLM, the first framework to leverage large language models in combination with hypergraph learning for student performance prediction. The framework incorporates FraS-HNN (Fra―melet-basedS―ignedH―ypergraphN―euralN―etworks), a novel spectral-based model for signed hypergraph learning, designed to model interactions between students and multiple-choice questions. In this setup, students and questions are represented as nodes, while response records are encoded as positive and negative signed hyperedges, effectively capturing both structural and semantic intricacies of personalized learning behaviors. FraS-HNN employs framelet-based low-pass and high-pass filters to extract multi-frequency features. EduLLM integrates fine-grained semantic features derived from LLMs, synergizing with signed hypergraph representations to enhance prediction accuracy. Extensive experiments conducted on multiple educational datasets demonstrate that EduLLM significantly outperforms state-of-the-art baselines, validating the novel integration of LLMs with FraS-HNN for signed hypergraph learning.

Paperid:3229 poster

Authors:Botao Chen, Jongyeong Lee, Junya Honda

Title: Geometric Resampling in Nearly Linear Time for Follow-the-Perturbed-Leader with Best-of-Both-Worlds Guarantee in Bandit Problems

Abstract:
This paper studies the complexity and optimality of Follow-the-Perturbed-Leader (FTPL) policy in theK-armed bandit problems. FTPL is a promising policy that achieves the Best-of-Both-Worlds (BOBW) guarantee without solving an optimization problem unlike Follow-the-Regularized-Leader (FTRL). However, FTPL needs a procedure called geometric resampling to estimate the loss, which needsO(K2)per-round average complexity, usually worse than that of FTRL. To address this issue, we propose a novel technique, which we call Conditional Geometric Resampling (CGR), for unbiased loss estimation applicable to general perturbation distributions. CGR reduces the average complexity toO(Klog⁡K)without sacrificing the regret bounds. We also propose a biased version of CGR that can control the worst-case complexity while keeping the BOBW guarantee for a certain perturbation distribution. We confirm through experiments that CGR does not only significantly improve the average and worst-case runtime but also achieve better regret thanks to the stable loss estimation.

Paperid:3230 poster

Authors:Zhixuan Chen, Xing Hu, Dawei Yang, Zukang Xu, XUCHEN, Zhihang Yuan, Sifan Zhou, JiangyongYu

Title: MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance

Abstract:
Mixture-of-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models face significant memory overheads, limiting their practical deployment and broader adoption. Post-training quantization (PTQ), a widely used method for compressing LLMs, encounters severe accuracy degradation and diminished generalization performance when applied to MoE models. This paper investigates the impact of MoE’s sparse and dynamic characteristics on quantization and identifies two primary challenges: (1) Inter-expert imbalance, referring to the uneven distribution of samples across experts, which leads to insufficient and biased calibration for less frequently utilized experts; (2) Intra-expert imbalance, arising from MoE's unique aggregation mechanism, which leads to varying degrees of correlation between different samples and their assigned experts. To address these challenges, we propose MoEQuant, a novel quantization framework tailored for MoE LLMs. MoEQuant includes two novel techniques: 1) Expert-Balanced Self-Sampling (EBSS) is an efficient sampling method that efficiently constructs a calibration set with balanced expert distributions by leveraging the cumulative probabilities of tokens and expert balance metrics as guiding factors. 2) Affinity-Guided Quantization (AGQ), which incorporates affinities between experts and samples into the quantization process, thereby accurately assessing the impact of individual samples on different experts within the MoE layer. Experiments demonstrate that MoEQuant achieves substantial performance gains (more than 10 points accuracy gain in the HumanEval for DeepSeekMoE-16B under 4-bit quantization) and boosts efficiency.

Paperid:3231 poster

Authors:Jaemoo Choi, Jaewoong Choi, Dohyun Kwon

Title: Overcoming Spurious Solutions in Semi-Dual Neural Optimal Transport: A Smoothing Approach for Learning the Optimal Transport Plan

Abstract:
We address the convergence problem in learning the Optimal Transport (OT) map, where the OT Map refers to a map from one distribution to another while minimizing the transport cost. Semi-dual Neural OT, a widely used approach for learning OT Maps with neural networks, often generates spurious solutions that fail to transfer one distribution to another accurately. We identify a sufficient condition under which the max-min solution of Semi-dual Neural OT recovers the true OT Map. Moreover, to address cases when this sufficient condition is not satisfied, we propose a novel method, OTP, which learns both the OT Map and the Optimal Transport Plan, representing the optimal coupling between two distributions. Under sharp assumptions on the distributions, we prove that our model eliminates the spurious solution issue and correctly solves the OT problem. Our experiments show that the OTP model recovers the optimal transport map where existing methods fail and outperforms current OT-based models in image-to-image translation tasks. Notably, the OTP model can learn stochastic transport maps when deterministic OT Maps do not exist, such as one-to-many tasks like colorization.

Paperid:3232 poster

Authors:Gaole Dai, Chun-Kai Fan, Yiming Tang, Zhi Zhang, Yuan Zhang, Yulu Gan, Qizhe Zhang, Cheng-Ching Tseng, Shanghang Zhang, Tiejun Huang

Title: SAN: Hypothesizing Long-Term Synaptic Development and Neural Engram Mechanism in Scalable Model's Parameter-Efficient Fine-Tuning

Abstract:
Advances in Parameter-efficient Fine-tuning (PEFT) bridged the performance gap with Full Fine-Tuning (FFT) through sophisticated analysis of pre-trained parameter spaces. Starting from drawing insights from Neural Engrams (NE) in Biological Neural Networks (BNNs), we establish a connection between the low-rank property observed during PEFT's parameter space shifting and neurobiological mechanisms. This observation leads to our proposed method,Synapse andNeuron (SAN), which decomposes and propagates the scaling component from anterior feature adjustment vectors towards posterior weight matrices. Our approach is theoretically grounded in Long-Term Potentiation/Depression (LTP/D) phenomena, which govern synapse development through neurotransmitter release modulation. Extensive experiments demonstrate its effectiveness: onvision tasksacross VTAB, FGVC, and GIC (25 datasets) using ViT, Swin-T and ConvNeXt architectures, SAN outperforms FFT up to8.7%and LoRA by3.2%; onlanguage tasksusing Commonsense Reasoning (8 datasets) with LLaMA models (all generations), surpassing ChatGPT up to8.5%and LoRA by4.7%; onvision-language tasksusing Visual Instruction Tuning (7 datasets) with LLaVA models, it exceeds FFT up to2.4%and LoRA by1.9%. Our code and W&B log will be released

Paperid:3233 poster

Authors:Ziming Zhu, Yu Zhu, Jiahao Chen, LingXiaofeng, Huanlei Chen, Lihua Sun

Title: CSV-Occ: Fusing Multi-frame Alignment for Occupancy Prediction with Temporal Cross State Space Model and Central Voting Mechanism

Abstract:
Recently, image-based 3D semantic occupancy prediction has become a hot topic in 3D scene understanding for autonomous driving. Compared with the bounding box form of 3D object detection, the ability to describe the fine-grained contours of any obstacles in the scene is the key insight of voxel occupancy representation, which facilitates subsequent tasks of autonomous driving. In this work, we propose CSV-Occ to address the following two challenges: (1) Existing methods fuse temporal information based on the attention mechanism, but are limited by high complexity. We extend the state space model to support multi-input sequence interaction and conduct temporal modeling in a cascaded architecture, thereby reducing the computational complexity from quadratic to linear. (2) Existing methods are limited by semantic ambiguity, resulting in the centers of foreground objects often being predicted as empty voxels. We enable the model to explicitly vote for the instance center to which the voxels belong and spontaneously learn to utilize the other voxel features of the same instance to update the semantics of the internal vacancies of the objects from coarse to fine. Experiments on the Occ3D-nuScenes dataset show that our method achieves state-of-the-art in camera-based 3D semantic occupancy prediction and also performs well on lidar point cloud semantic segmentation on the nuScenes dataset. Therefore, we believe that CSV-Occ is beneficial to the community and industry of autonomous vehicles.

Paperid:3234 poster

Authors:Alexander Bienstock, Ujjwal Kumar, Antigoni Polychroniadou

Title: DMM: Distributed Matrix Mechanism for Differentially-Private Federated Learning Based on Constant-Overhead Linear Secret Resharing

Abstract:
Federated Learning (FL) solutions with central Differential Privacy (DP) have seen large improvements in their utility in recent years arising from the matrix mechanism, while FL solutions with distributed (more private) DP have lagged behind. In this work, we introduce the distributed matrix mechanism to achieve the best-of-both-worlds; better privacy of distributed DP and better utility from the matrix mechanism. We accomplish this using a novel cryptographic protocol that securely transfers sensitive values across client committees of different training iterations with constant communication overhead. This protocol accommodates the dynamic participation of users required by FL, including those that may drop out from the computation. We provide experiments which show that our mechanism indeed significantly improves the utility of FL models compared to previous distributed DP mechanisms, with little added overhead.

Paperid:3235 poster

Authors:Yixin Chen, Wenjing Chen, Alan Kuhnle

Title: Breaking Barriers: Combinatorial Algorithms for Non-Monotone Submodular Maximization with Sublinear Adaptivity and1/eApproximation

Abstract:
With the rapid growth of data in modern applications, parallel combinatorial algorithms for maximizing non-monotone submodular functions have gained significant attention. In the parallel computation setting, the state-of-the-art approximation ratio of1/eis achieved by a continuous algorithm (Ene & Nguyen, 2020) with adaptivityO(log(n)). In this work, we focus on size constraints and present the first combinatorial algorithm matching this bound �?a randomized parallel approach achieving1/e−ϵapproximation ratio. This result bridges the gap between continuous and combinatorial approaches for this problem. As a byproduct, we also develop a simpler(1/4−�?-approximation algorithm with high probability(�?�?/n). Both algorithms achieveO(log(n)log(k))adaptivity andO(nlog(n)log(k))querycomplexity.Empiricalresultsshowouralgorithmsachievecompetitiveobjectivevalues,withthe(1/4 �?\epsilon)$-approximation algorithm particularly efficient in queries.

Paperid:3236 poster

Authors:Zhiyao Zhang, Myeung Suk Oh, FNU Hairi, Ziyue Luo, Alvaro Velasquez, Jia Liu

Title: Finite-Time Global Optimality Convergence in Deep Neural Actor-Critic Methods for Decentralized Multi-Agent Reinforcement Learning

Abstract:
Actor-critic methods for decentralized multi-agent reinforcement learning (MARL) facilitate collaborative optimal decision making without centralized coordination, thus enabling a wide range of applications in practice. To date, however, most theoretical convergence studies for existing actor-critic decentralized MARL methods are limited to the guarantee of a stationary solution under the linear function approximation. This leaves a significant gap between the highly successful use of deep neural actor-critic for decentralized MARL in practice and the current theoretical understanding. To bridge this gap, in this paper, we make the first attempt to develop a deep neural actor-critic method for decentralized MARL, where both the actor and critic components are inherently non-linear. We show that our proposed method enjoys a global optimality guarantee with a finite-time convergence rate ofO(1/T), whereTis the total iteration times. This marks the first global convergence result for deep neural actor-critic methods in the MARL literature. We also conduct extensive numerical experiments, which verify our theoretical results.

Paperid:3237 poster

Authors:Lixin Yuan, Yirui Wu, Wenxiao Zhang, Minglei Yuan, Jun Liu

Title: Stray Intrusive Outliers-Based Feature Selection on Intra-Class Asymmetric Instance Distribution or Multiple High-Density Clusters

Abstract:
For data with intra-class Asymmetric instance Distribution or Multiple High-density Clusters (ADMHC), outliers are real and have specific patterns for data classification, where the class body is necessary and difficult to identify. Previous Feature Selection (FS) methods score features based on all training instances or rarely target intra-class ADMHC. In this paper, we propose a supervised FS method, Stray Intrusive Outliers-based FS (SIOFS), for data classification with intra-class ADMHC. By focusing on Stray Intrusive Outliers (SIOs), SIOFS modifies the skewness coefficient and fuses the threshold in the 3$\sigma$ principle to identify the class body, scoring features based on the intrusion degree of SIOs. In addition, the refined density-mean center is proposed to represent the general characteristics of the class body reasonably. Mathematical formulations, proofs, and logical exposition ensure the rationality and universality of the settings in the proposed SIOFS method. Extensive experiments on 16 diverse benchmark datasets demonstrate the superiority of SIOFS over 12 state-of-the-art FS methods in terms of classification accuracy, normalized mutual information, and confusion matrix. SIOFS source codes is available athttps://github.com/XXXly/2025-ICML-SIOFS

Paperid:3238 poster

Authors:Zhigaoyuan Wang, Ying Sun, Hengshu Zhu

Title: Unifying Knowledge from Diverse Datasets to Enhance Spatial-Temporal Modeling: A Granularity-Adaptive Geographical Embedding Approach

Abstract:
Spatio-temporal forecasting provides potential for discovering evolutionary patterns in geographical scientific data. However, geographical scientific datasets are often manually collected across studies, resulting in limited time spans and data scales. This hinders existing methods that rely on rich historical data for individual entities. In this paper, we argue that heterogeneous datasets from different studies can provide complementary insights into the same underlying system, helping improve predictions for geographical entities with limited historical data. To this end, we propose a Segment Quadtree Geographical Embedding Framework (SQGEF). SQGEF integrates knowledge from datasets with varied target entities, time spans, and observation variables to learn unified representations for multi-granularity entities—including those absent during training. Specifically, we propose a novel data structure, Segment Quadtree, that flexibly accommodates entities of varying granularities. SQGEF not only captures multi-level interactions from grid data but also extracts nested relationships and human-defined boundaries from diverse entities, enabling a comprehensive understanding of complex geographical structures. Experiments on real-world datasets demonstrate that SQGEF effectively represents unseen geographical entities and enhances performance for various models.

Paperid:3239 poster

Authors:Ruofeng Yang, Bo Jiang, Cheng Chen, Shuai Li

Title: Improved Discretization Complexity Analysis of Consistency Models: Variance Exploding Forward Process and Decay Discretization Scheme

Abstract:
Consistency models, a new class of one-step generative models, have shown competitive performance with multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the continuous diffusion process into $K$ steps and trains a one-step mapping function on these discretized timepoints. Despite the empirical success, only a few works focus on the discretization complexity $K$, and their setting is far from that of empirical works. More specifically, the current theoretical works analyze the variance preserving (VP) diffusion process with a uniform stepsize, while empirical works adopt a variance exploding (VE) process with a decay discretization stepsize. As a result, these works suffer from large discretization complexity and fail to explain the empirical success of consistency models. To close the gap between theory and application, we analyze consistency models with (1) VE process and (2) decay stepsize and prove the state-of-the-art discretization complexity for consistency models. This result is competitive with the results of diffusion models and shows the potential of consistency models. To balance the computation and performance, previous empirical work further proposes a $2$-step consistency algorithm. In this work, we also analyze the role of $2$-step sampling and show that it improves the discretization complexity compared with one-step generation.

Paperid:3240 poster

Authors:Zhongtian Ma, Qiaosheng Zhang, Bocheng Zhou, Yexin Zhang, Shuyue Hu, Zhen Wang

Title: Graph Attention is Not Always Beneficial: A Theoretical Analysis of Graph Attention Mechanisms via Contextual Stochastic Block Models

Abstract:
Despite the growing popularity of graph attention mechanisms, their theoretical understanding remains limited. This paper aims to explore the conditions under which these mechanisms are effective in node classification tasks through the lens of Contextual Stochastic Block Models (CSBMs). Our theoretical analysis reveals that incorporating graph attention mechanisms isnot universally beneficial. Specifically, by appropriately definingstructure noiseandfeature noisein graphs, we show that graph attention mechanisms can enhance classification performance when structure noise exceeds feature noise. Conversely, when feature noise predominates, simpler graph convolution operations are more effective. Furthermore, we examine the over-smoothing phenomenon and show that, in the high signal-to-noise ratio (SNR) regime, graph convolutional networks suffer from over-smoothing, whereas graph attention mechanisms can effectively resolve this issue. Building on these insights, we propose a novel multi-layer Graph Attention Network (GAT) architecture that significantly outperforms single-layer GATs in achievingperfect node classificationin CSBMs, relaxing the SNR requirement fromω(log⁡n)toω(log⁡n/n3). To our knowledge, this is the first study to delineate the conditions for perfect node classification using multi-layer GATs. Our theoretical contributions are corroborated by extensive experiments on both synthetic and real-world datasets, highlighting the practical implications of our findings.

Paperid:3241 poster

Authors:Zuchao Li, Yonghua Hei, Qiwei Li, Lefei Zhang, Ping Wang, hai zhao, Baoyuan Qi, Liu Guoming

Title: What Limits Bidirectional Model's Generative Capabilities? A Uni-Bi-Directional Mixture-of-Expert Method For Bidirectional Fine-tuning

Abstract:
Large Language Models (LLMs) excel in generation tasks, yet their causal attention mechanisms limit performance in embedding tasks. While bidirectional modeling may enhance embeddings, naively fine-tuning unidirectional models bidirectionally severely degrades generative performance. To investigate this trade-off, we analyze attention weights as dependence indicators and find that bidirectional fine-tuning increases subsequent dependence, impairing unidirectional generation. Through systematic Transformer module evaluations, we discover the FFN layer is least affected by such dependence. Leveraging this discovery, we propose UBMoE-LLM, a novel Uni-Bi-directional Mixture-of-Experts LLM, which integrates the original unidirectional FFN with a bidirectionally fine-tuned FFN via unsupervised contrastive learning. This MoE-based approach enhances embedding performance while preserving robust generation. Extensive experiments across diverse datasets and model scales validate our attention dependence metric and demonstrate UBMoE-LLM’s superior generative quality and reduced hallucination. Code is available at:https://github.com/heiyonghua/ubmoe_llm.

Paperid:3242 poster

Authors:Aditya Taparia, Som Sagar, Ransalu Senanayake

Title: Explainable Concept Generation through Vision-Language Preference Learning for Understanding Neural Networks' Internal Representations

Abstract:
Understanding the inner representation of a neural network helps users improve models. Concept-based methods have become a popular choice for explaining deep neural networks post-hoc because, unlike most other explainable AI techniques, they can be used to test high-level visual "concepts" that are not directly related to feature attributes. For instance, the concept of "stripes" is important to classify an image as a zebra. Concept-based explanation methods, however, require practitioners to guess and manually collect multiple candidate concept image sets, making the process labor-intensive and prone to overlooking important concepts. Addressing this limitation, in this paper, we frame concept image set creation as an image generation problem. However, since naively using a standard generative model does not result in meaningful concepts, we devise a reinforcement learning-based preference optimization (RLPO) algorithm that fine-tunes a vision-language generative model from approximate textual descriptions of concepts. Through a series of experiments, we demonstrate our method's ability to efficiently and reliably articulate diverse concepts that are otherwise challenging to craft manually.

Paperid:3243 poster

Authors:Shi Yin, Xinyang Pan, Fengyan Wang, Lixin He

Title: TraceGrad: a Framework Learning Expressive SO(3)-equivariant Non-linear Representations for Electronic-Structure Hamiltonian Prediction

Abstract:
We propose a framework to combine strong non-linear expressiveness with strict SO(3)-equivariance in prediction of the electronic-structure Hamiltonian, by exploring the mathematical relationships between SO(3)-invariant and SO(3)-equivariant quantities and their representations. The proposed framework, calledTraceGrad, first constructs theoretical SO(3)-invarianttracequantities derived from the Hamiltonian targets, and use these invariant quantities as supervisory labels to guide the learning of high-quality SO(3)-invariant features. Given that SO(3)-invariance is preserved under non-linear operations, the learning of invariant features can extensively utilize non-linear mappings, thereby fully capturing the non-linear patterns inherent in physical systems. Building on this, we propose agradient-based mechanism to induce SO(3)-equivariant encodings of various degrees from the learned SO(3)-invariant features. This mechanism can incorporate powerful non-linear expressive capabilities into SO(3)-equivariant features with correspondence of physical dimensions to the regression targets, while theoretically preserving equivariant properties, establishing a strong foundation for predicting electronic-structure Hamiltonian. Experimental results on eight challenging benchmark databases demonstrate that our method achieves state-of-the-art performance in Hamiltonian prediction.

Paperid:3244 poster

Authors:Yusuke Yamasaki, Kenta Niwa, Daiki Chijiwa, Takumi Fukami, Takayuki Miura

Title: Plausible Token Amplification for Improving Accuracy of Differentially Private In-Context Learning Based on Implicit Bayesian Inference

Abstract:
We propose Plausible Token Amplification (PTA) to improve the accuracy of Differentially Private In-Context Learning (DP-ICL) using DP synthetic demonstrations. While Tang et al. empirically improved the accuracy of DP-ICL by limiting vocabulary space during DP synthetic demonstration generation, its theoretical basis remains unexplored. By interpreting ICL as implicit Bayesian inference on a concept underlying demonstrations, we not only provide theoretical evidence supporting Tang et al.'s empirical method but also introduce PTA, a refined method for modifying next-token probability distribution. Through the modification, PTA highlights tokens that distinctly represent the ground-truth concept underlying the original demonstrations. As a result, generated DP synthetic demonstrations guide the Large Language Model to successfully infer the ground-truth concept, which improves the accuracy of DP-ICL. Experimental evaluations on both synthetic and real-world text-classification datasets validated the effectiveness of PTA.

Paperid:3245 poster

Authors:Jiachen Guo, Xiaoyu Xie, Chanwook Park, Hantao Zhang, Matthew J. Politis, Gino Domel, Wing Kam Liu

Title: Interpolating Neural Network-Tensor Decomposition (INN-TD): a scalable and interpretable approach for large-scale physics-based problems

Abstract:
Deep learning has been extensively employed as a powerful function approximator for modeling physics-based problems described by partial differential equations (PDEs). Despite their popularity, standard deep learning models often demand prohibitively large computational resources and yield limited accuracy when scaling to large-scale, high-dimensional physical problems. Their black-box nature further hinders their application in industrial problems where interpretability and high precision are critical. To overcome these challenges, this paper introduces Interpolating Neural Network-Tensor Decomposition (INN-TD), a scalable and interpretable framework that has the merits of both machine learning and finite element methods for modeling large-scale physical systems. By integrating locally supported interpolation functions from finite element into the network architecture, INN-TD achieves a sparse learning structure with enhanced accuracy, faster training/solving speed, and reduced memory footprint. This makes it particularly effective for tackling large-scale high-dimensional parametric PDEs in training, solving, and inverse optimization tasks in physical problems where high precision is required.

Paperid:3246 poster

Authors:Junkang Liu, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, Wei Feng

Title: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging

Abstract:
For federated learning (FL) algorithms such as FedSAM, their generalization capability is crucial for real-word applications. In this paper, we revisit the generalization problem in FL and investigate the impact of data heterogeneity on FL generalization. We find that FedSAM usually performs worse than FedAvg in the case of highly heterogeneous data, and thus propose a novel and effective federated learning algorithm with Stochastic Weight Averaging (called \texttt{FedSWA}), which aims to find flatter minima in the setting of highly heterogeneous data. Moreover, we introduce a new momentum-based stochastic controlled weight averaging FL algorithm (\texttt{FedMoSWA}), which is designed to better align local and global models. Theoretically, we provide both convergence analysis and generalization bounds for \texttt{FedSWA} and \texttt{FedMoSWA}. We also prove that the optimization and generalization errors of \texttt{FedMoSWA} are smaller than those of their counterparts, including FedSAM and its variants. Empirically, experimental results on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed algorithms compared to their counterparts.

Paperid:3247 poster

Authors:Le-Tuyet-Nhi PHAM, Dario Shariatian, Antonio Ocello, Giovanni Conforti, Alain Oliviero Durmus

Title: Discrete Markov Probabilistic Models: An Improved Discrete Score-Based Framework with sharp convergence bounds under minimal assumptions

Abstract:
This paper introduces the Discrete Markov Probabilistic Model (DMPM), a novel algorithm for discrete data generation. The algorithm operates in discrete space, where the noising process is a continuous-time Markov chain that can be sampled exactly via a Poissonian clock that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process, with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation of a function of the forward process, strengthening its theoretical alignment with score-based generative models while ensuring robustness and efficiency. We further establish convergence bounds for the algorithm under minimal assumptions and demonstrate its effectiveness through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight its strong performance in generating discrete structures. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling.

Paperid:3248 poster

Authors:Anqi Lu, Junchi Yan

Title: Learning Initial Basis Selection for Linear Programming via Duality-Inspired Tripartite Graph Representation and Comprehensive Supervision

Abstract:
For the fundamental linear programming (LP) problems, the simplex method remains popular, which usually requires an appropriate initial basis as a warm start to accelerate the solving process. Predicting an initial basis close to an optimal one can often accelerate the solver, but a closer initial basis does not always result in greater acceleration. To achieve better acceleration, we propose a GNN model based on a tripartite graph representation inspired by LP duality. This approach enables more effective feature extraction for general LP problems and enhances the expressiveness of GNNs. Additionally, we introduce novel loss functions targeting basic variable selection and basis feasibility, along with data preprocessing schemes, to further improve learning capability. In addition to achieving high prediction accuracy, we enhance the quality of the initial basis for practical use. Experimental results show that our approach greatly surpasses the state-of-the-art method in predicting initial basis with greater accuracy and in reducing the number of iterations and solving time of the LP solver.

Paperid:3249 poster

Authors:Thomas Paniagua, Chinmay Savadikar, Tianfu Wu

Title: Adversarial Perturbations Are Formed by Iteratively Learning Linear Combinations of the Right Singular Vectors of the Adversarial Jacobian

Abstract:
White-box targeted adversarial attacks reveal core vulnerabilities in Deep Neural Networks (DNNs), yet two key challenges persist: (i) How many target classes can be attacked simultaneously in a specified order, known as theordered top-Kattackproblem (K�?)? (ii) How to compute the corresponding adversarial perturbations for a given benign image directly in the image space? We address both by showing thatordered top-Kperturbations can be learned via iteratively optimizing linear combinations of theri―ghtsing―ularvectors of the adversarial Jacobian(i.e., the logit-to-image Jacobian constrained by target ranking). These vectors span an orthogonal, informative subspace in the image domain. We introduceRisingAttacK, a novel Sequential Quadratic Programming (SQP)-based method that exploits this structure. We propose a holistic figure-of-merits (FoM) metric combining attack success rates (ASRs) andℓp-norms (p=1,2,�?. Extensive experiments on ImageNet-1k across six ordered top-Klevels (K=1,5,10,15,20,25,30) and four models (ResNet-50, DenseNet-121, ViT-B, DEiT-B) show RisingAttacK consistently surpasses the state-of-the-art QuadAttacK.

Paperid:3250 poster

Authors:Ke Zhu, Shu Yang, Xiaofei Wang

Title: Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing

Abstract:
External controls from historical trials or observational data can augment randomized controlled trials when large-scale randomization is impractical or unethical, such as in drug evaluation for rare diseases. However, non-randomized external controls can introduce biases, and existing Bayesian and frequentist methods may inflate the type I error rate, particularly in small-sample trials where external data borrowing is most critical. To address these challenges, we propose a randomization inference framework that ensures finite-sample exact and model-free type I error rate control, adhering to the “analyze as you randomize�?principle to safeguard against hidden biases. Recognizing that biased external controls reduce the power of randomization tests, we leverage conformal inference to develop an individualized test-then-pool procedure that selectively borrows comparable external controls to improve power. Our approach incorporates selection uncertainty into randomization tests, providing valid post-selection inference. Additionally, we propose an adaptive procedure to optimize the selection threshold by minimizing the mean squared error across a class of estimators encompassing both no-borrowing and full-borrowing approaches. The proposed methods are supported by non-asymptotic theoretical analysis, validated through simulations, and applied to a randomized lung cancer trial that integrates external controls from the National Cancer Database.

Paperid:3251 poster

Authors:Ihab Bendidi, Yassir El Mesbahi, Alisandra Kaye Denton, Karush Suri, Kian Kenyon-Dean, Auguste Genovesio, Emmanuel Noutahi

Title: A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features

Abstract:
Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1)Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2)PEA(PerturbationEmbeddingAugmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.

Paperid:3252 poster

Authors:Emanuele Troiani, Hugo Cui, Yatin Dandi, Florent Krzakala, Lenka Zdeborova

Title: Fundamental limits of learning in sequence multi-index models and deep attention networks: high-dimensional asymptotics and sharp thresholds

Abstract:
In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayes-optimal learning, in the limit of large dimension $D$ and proportionally large number of samples $N$, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing--, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.

Paperid:3253 poster

Authors:Geonhui Yoo, Minhak Song, Chulhee Yun

Title: Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More

Abstract:
When training deep neural networks with gradient descent, sharpness often increases---a phenomenon known asprogressive sharpening---before saturating at theedge of stability. Although commonly observed in practice, the underlying mechanisms behind progressive sharpening remain poorly understood. In this work, we study this phenomenon using a minimalist model: a deep linear network with a single neuron per layer. We show that this simple model effectively captures the sharpness dynamics observed in recent empirical studies, offering a simple testbed to better understand neural network training. Moreover, we theoretically analyze how dataset properties, network depth, stochasticity of optimizers, and step size affect the degree of progressive sharpening in the minimalist model. We then empirically demonstrate how these theoretical insights extend to practical scenarios. This study offers a deeper understanding of sharpness dynamics in neural network training, highlighting the interplay between depth, training data, and optimizers.

Paperid:3254 poster

Authors:Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen

Title: CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

Abstract:
Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5×FLOPs reduction and a 10×overall speedup. Code is released athttps://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.

Paperid:3255 poster

Authors:Mingi Jung, Saehyung Lee, Eunji Kim, Sungroh Yoon

Title: Visual Attention Never Fades: Selective Progressive Attention ReCalibration for Detailed Image Captioning in Multimodal Large Language Models

Abstract:
Detailed image captioning is essential for tasks like data generation and aiding visually impaired individuals. High-quality captions require a balance between precision and recall, which remains challenging for current multimodal large language models (MLLMs). In this work, we hypothesize that this limitation stems from weakening and increasingly noisy visual attention as responses lengthen. To address this issue, we propose SPARC (Selective Progressive Attention ReCalibration), a training-free method that enhances the contribution of visual tokens during decoding. SPARC is founded on three key observations: (1) increasing the influence of all visual tokens reduces recall; thus, SPARC selectively amplifies visual tokens; (2) as captions lengthen, visual attention becomes noisier, so SPARC identifies critical visual tokens by leveraging attention differences across time steps; (3) as visual attention gradually weakens, SPARC reinforces it to preserve its influence. Our experiments, incorporating both automated and human evaluations, demonstrate that existing methods improve the precision of MLLMs at the cost of recall. In contrast, our proposed method enhances both precision and recall with minimal computational overhead.

Paperid:3256 poster

Authors:Nikita Lagrange, Herve Isambert

Title: An Efficient Search-and-Score Algorithm for Ancestral Graphs using Multivariate Information Scores for Complex Non-linear and Categorical Data

Abstract:
We propose a greedy search-and-score algorithm for ancestral graphs, which include directed as well as bidirected edges, originating from unobserved latent variables. The normalized likelihood score of ancestral graphs is estimated in terms of multivariate information over relevant "$ac$-connected subset" of vertices, $\boldsymbol{C}$, that are connected through collider paths confined to the ancestor set of $\boldsymbol{C}$. For computational efficiency, the proposed two-step algorithm relies on local information scores limited to the close surrounding vertices of each node (step 1) and edge (step 2). This computational strategy, although restricted to information contributions from $ac$-connected subsets containing up to two-collider paths, is shown to outperform state-of-the-art causal discovery methods on challenging benchmark datasets.

Paperid:3257 poster

Authors:Bin Yang, Xiaojie Wang

Title: Non-asymptotic Error Bounds inW2-Distance with Sqrt(d) Dimension Dependence and First Order Convergence for Langevin Monte Carlo beyond Log-Concavity

Abstract:
Generating samples from a high dimensional probability distribution is a fundamental task with wide-ranging applications in the area of scientific computing, statistics and machine learning. This article revisits the popular Langevin Monte Carlo (LMC) sampling algorithms and provides a non-asymptotic error analysis inW2-distance in a non-convex setting. In particular, we prove an error boundO(dh), which guarantees a mixing timeO~(dϵ�?)to achieve the accuracy toleranceϵ, under certain log-smooth conditions and the assumption that the target distribution satisfies a log-Sobolev inequality, as opposed to the strongly log-concave condition used in (Li et al., 2019; 2022). This bound matches the best one in the strongly log-concave case and improves upon the best-known convergence rates in non-convex settings. To prove it, we establish a new framework of uniform-in-time convergence for discretizations of SDEs. Distinct from (Li et al., 2019; 2022), we start from the finite-time mean-square fundamental convergence theorem, which combined with uniform-in-time moment bounds of LMC and the exponential ergodicity of SDEs in the non-convex setting gives the desired uniform-in-time convergence. Our framework also applies to the case when the gradient of the potentialUis non-globally Lipschitz with superlinear growth, for which modified LMC samplers are proposed and analyzed, with a non-asymptotic error bound inW2-distance obtained. Numerical experiments corroborate the theoretical analysis.

Paperid:3258 poster

Authors:Binghui Li, Yuanzhi Li

Title: On the Clean Generalization and Robust Overfitting in Adversarial Training from Two Theoretical Views: Representation Complexity and Training Dynamics

Abstract:
Similar to surprising performance in the standard deep learning, deep nets trained by adversarial training also generalize well for unseen clean data (natural data). However, despite adversarial training can achieve low robust training error, there exists a significant robust generalization gap. We call this phenomenon the Clean Generalization and Robust Overfitting (CGRO). In this work, we study the CGRO phenomenon in adversarial training from two views: representation complexity and training dynamics. Specifically, we consider a binary classification setting withNseparated training data points. First, we prove that, based on the assumption that we assume there ispoly(D)-size clean classifier (whereDis the data dimension), ReLU net with onlyO(ND)extra parameters is able to leverages robust memorization to achieve the CGRO, while robust classifier still requires exponential representation complexity in worst case. Next, we focus on a structured-data case to analyze training dynamics, where we train a two-layer convolutional network withO(ND)width against adversarial perturbation. We then show that a three-stage phase transition occurs during learning process and the network provably converges to robust memorization regime, which thereby results in the CGRO. Besides, we also empirically verify our theoretical analysis by experiments in real-image recognition datasets.